[Python-Dev] Split unicodeobject.c into subfiles

Victor Stinner

unread,

Oct 22, 2012, 8:50:32 PM10/22/12

to Python Dev

Hi,

I forked CPython repository to work on my "split unicodeobject.c" project:
http://hg.python.org/sandbox/split-unicodeobject.c

The result is 10 files (included the existing unicodeobject.c):

1176 Objects/unicodecharmap.c
1678 Objects/unicodecodecs.c
1362 Objects/unicodeformat.c
253 Objects/unicodeimpl.h
733 Objects/unicodelegacy.c
1836 Objects/unicodenew.c
2777 Objects/unicodeobject.c
2421 Objects/unicodeoperators.c
1235 Objects/unicodeoscodecs.c
1288 Objects/unicodeutfcodecs.c
14759 total

This is just a proposition (and work in progress). Everything can be changed :-)

"unicodenew.c" is not a good name. Content of this file may be moved
somewhere else.

Some files may be merged again if the separation is not justified.

I don't like the "unicode" prefix for filenames, I would prefer a new directory.

--

Shorter files are easier to review and maintain. The compilation is
faster if only one file is modified.

The MBCS codec requires windows.h. The whole unicodeobject.c includes
it just for this codec. With the split, only unicodeoscodecs.c
includes this file.

The MBCS codec needs also a "winver" variable. This variable is
defined between the BLOOM filter and the unicode_result_unchanged()
function. How can you explain how these things are sorted? Where
should I add a new function or variable? With the split, the variable
is now defined very close to where is it used. You don't have to
scroll 7000 lines to see where it is used.

If you would like to work on a specific function, you don't have to
use the search function of your editor to skip thousands to lines. For
example, the 18 functions and 2 types related to the charmap codec are
now grouped into one unique and short C file.

It was already possible to extend and maintain unicodeobject.c (some
people proved it!), but it should now be much simpler with shorter
files.

Note: unicodeobject.c is also composed by the huge stringlib library
(4000 lines), which is shared with the bytes type.

--

* Objects/unicodeimpl.h

Private macros and prototype of private functions.

Many unicode_xxx() functions has been renamed to _PyUnicode_xxx() to
be able to reuse them in different files.

* Objects/unicodenew.c

Functions to create a new Unicode string (PyUnicode_New), convert
from/to UCS4 and wchar_t*, resize a string. The ugly part of the PEP
393.

* Objects/unicodeoperators.c

find, replace, compare, split, fill, etc.

* Objects/unicodeobject.c

"str" type with all methods, _string module and unicodeiter type.

* Objects/unicodeformat.c

PyUnicode_FromFormat() and PyUnicode_Format()

* Objects/unicodecodecs.c

Text codecs for Python Unicode strings:
- PyUnicode_Decode()
- PyUnicode_AsEncodedObject()
- PyUnicode_DecodeUnicodeEscape()
- PyUnicode_DecodeRawUnicodeEscape(), PyUnicode_AsRawUnicodeEscapeString()
- _PyUnicode_DecodeUnicodeInternal()
- PyUnicode_DecodeLatin1(), PyUnicode_AsLatin1String()
- PyUnicode_AsASCIIString()
- PyUnicode_EncodeDecimal()
- many helpers for other codecs
- ...

* Objects/unicodecharmap.c

Character Mapping Codec:
- PyUnicode_BuildEncodingMap()
- PyUnicode_DecodeCharmap()
- PyUnicode_AsCharmapString()
- PyUnicode_Translate()

* Objects/unicodeoscodecs.c

Operating system codecs: MBCS codec, locale (FS) codec => FS encode/decode.

* Objects/unicodeutfcodecs.c

UTF-7/8/16/32 codecs and ASCII decoder.

* Objects/unicodelegacy.c

Legacy and deprecated Unicode API: Py_UNICODE type.

Victor
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Benjamin Peterson

unread,

Oct 23, 2012, 4:22:10 AM10/23/12

to Victor Stinner, Python Dev

2012/10/22 Victor Stinner <victor....@gmail.com>:

I would like to repeat my opposition to splitting unicodeobject.c. I
don't think the benefits of such a split have been well justified,
certainly not to the point that the claim about "much simpler"
maintenance is true.

--
Regards,
Benjamin

M.-A. Lemburg

unread,

Oct 23, 2012, 5:28:39 AM10/23/12

to Benjamin Peterson, Python Dev

Same feelings here.

If you do go ahead with such a split, please only split the source
files and keep the unicodeobject.c file which then includes all
the other files. Such a restructuring should not result in compilers
no longer being able to optimize code by inlining functions
in one of the most important basic types we have in Python 3.

Also note that splitting the file in multiple smaller ones will
actually create more maintenance overhead, since patches will
likely no longer be easy to merge from 3.3 to 3.4.

BTW: The positive effect of having everything in one file is
that you no longer have to figure which files to look when
trying to find a piece of logic... it's just a ctrl-f or
ctrl-s away :-)

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 23 2012)
>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34
2012-09-25: Released mxODBC 3.2.1 ... http://egenix.com/go33
2012-10-23: Python Meeting Duesseldorf ... today

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Victor Stinner

unread,

Oct 23, 2012, 6:05:21 AM10/23/12

to Python Dev

> Such a restructuring should not result in compilers
> no longer being able to optimize code by inlining functions
> in one of the most important basic types we have in Python 3.

I agree that performances are important. But I'm not convinced than
moving functions has a real impact on performances, not that such
issues cannot be fixed.

I tried to limit changes impacting performances. Inlining is (only?)
interesting for short functions. PEP 393 introduces many macros for
this. I also added some "Fast" functiions
(_PyUnicode_FastCopyCharacters() and _PyUnicode_FastFill()) which
don't check parameters and do the real work. I don't think that it's
really useful to inline _PyUnicode_FastFill() in the caller for
example.

I will check performances of all str methods. For example, str.count()
is now calling PyUnicode_Count() instead of the static count().
PyUnicode_Count() adds some extra checks, some of them are not
necessary, and it's not a static function, so it cannot(?) be inlined.
But I bet that the overhead is really low.

Note: Since GCC 4.5, Link Time Optimization are possible. I don't know
if GCC is able to inline functions defined in different files, but C
compilers are better at each release.

--

I will check the impact of performances on _PyUnicode_Widen() and
_PyUnicode_Putchar(), which are no more static. _PyUnicode_Widen() and
_PyUnicode_Putchar() are used in Unicode codecs when it's more
expensive to compute the exact length and maximum character of the
output string. These functions are optimistic (hope that the output
will not grow too much and the string is not "widen" too much times,
so it should be faster for ASCII).

I implemented a similar approach in my PyUnicodeWriter API, and I plan
to reuse this API to simplify the API. PyUnicodeWriter uses some macro
to limit the overhead of having to check before each write if we need
to enlarge or widen the internal buffer, and allow to write directly
into the buffer using low level functions like PyUnicode_WRITE.

I also hope a performance improvement because the PyUnicodeWriter API
can also overallocate the internal buffer to reduce the number of
calls to realloc() (which is usually slow).

> Also note that splitting the file in multiple smaller ones will
> actually create more maintenance overhead, since patches will
> likely no longer be easy to merge from 3.3 to 3.4.

I'm a candidate to maintain unicodeobject.c. In your check
unicodeobject.c (recent) history, I'm one of the most active developer
on this file since two years (especially in 2012). I'm not sure that
merges on this file are so hard.

Victor

Antoine Pitrou

unread,

Oct 23, 2012, 6:11:11 AM10/23/12

to pytho...@python.org

Le 23/10/2012 12:05, Victor Stinner a écrit :
>> Such a restructuring should not result in compilers
>> no longer being able to optimize code by inlining functions
>> in one of the most important basic types we have in Python 3.
>
> I agree that performances are important. But I'm not convinced than
> moving functions has a real impact on performances, not that such
> issues cannot be fixed.

I agree with Marc-André, there's no point in compiling those files
separately. #include'ing them in the master unicodeobject.c file is fine.

Regards

Antoine.

Amaury Forgeot d'Arc

unread,

Oct 23, 2012, 8:03:44 AM10/23/12

to Antoine Pitrou, pytho...@python.org

2012/10/23 Antoine Pitrou <soli...@pitrou.net>:

> I agree with Marc-André, there's no point in compiling those files
> separately. #include'ing them in the master unicodeobject.c file is fine.

I also find the unicodeobject.c difficult to navigate.
Even if we don't split the file, I'd advocate a better presentation of
its content.

Could we have at least clear sections, with titles and descriptions?
And use the ^L page separator for Emacs users?

Code in posixmodule.c could also benefit of a better layout.

--
Amaury Forgeot d'Arc

Georg Brandl

unread,

Oct 23, 2012, 12:29:53 PM10/23/12

to pytho...@python.org

I agree. I haven't edited much in unicodeobject.c lately, so this is
just an expression of my preference in general to keep things together.

We tell new Python programmers to stop worrying about using indentation
for grouping because editors are meant to make this easy. A similar
argument applies to navigating large files: with a decent editor there is
no real problem with large files.

I agree completely with suggestions to improve sectioning and/or comments
within the file.

But once you make any split, people will look for things in the wrong file.
It happens for me every time I look for something in either object.c or
abstract.c -- that's an instance where the function name prefix doesn't imply
the implementation file name, which is otherwise very clear and easy in the
Python sources.

Especially since you're suggesting a huge number of new files, I question the
argument of better navigability.

Georg

BTW:

> If you would like to work on a specific function, you don't have to
> use the search function of your editor to skip thousands to lines. For
> example, the 18 functions and 2 types related to the charmap codec are
> now grouped into one unique and short C file.

After opening the right file, I *still* use the search function to get to
the function I want to edit. Don't tell me using a scroll bar to scan
for the right place is faster...

Larry Hastings

unread,

Oct 24, 2012, 12:04:08 PM10/24/12

to pytho...@python.org

On 10/23/2012 09:29 AM, Georg Brandl wrote:

Especially since you're suggesting a huge number of new files, I question the
argument of better navigability.

FWIW I'm -1 on it too. I don't see what the big deal is with "large" source files. If you have difficulty finding your way around unicodeobject.c, that seems like more like a tooling issue to me, not a source code structural issue.

/arry

Nick Coghlan

unread,

Oct 24, 2012, 6:15:14 PM10/24/12

to Larry Hastings, pytho...@python.org

OK, I need to weigh in after seeing this kind of reply. Large source files are discouraged in general because they're a code smell that points strongly towards a *lack of modularity* within a *complex piece of functionality*.

Breaking such files up into separately compiled modules serves two purposes:
1. It proves that the code *isn't* a tangled monolithic mess;
2. It enlists the compilation toolchain's assistance in ensuring that remains the case in the future.

I find complaints about the ease of searching within the file to be misguided and irrelevant, as I can just as easily reply with "if searching across multiple files is hard for you, use better tools, like grep, or 'Find in Files'".

Note that I also consider the "pro" argument about better navigability inaccurate - the real gain is in *modularity*, making it clear to readers which parts can be understood and worked on separately from each other.

We are not special snow flakes - good software engineering practice is advisable for us as well, so a big +1 from me for breaking up the monstrosity that is unicodeobject.c and lowering the barrier to entry for hacking on the individual pieces. This should come with a large block comment in unicodeobject.c explaining how the pieces are put back together again.

However, -1 on the "faux modularity" idea of breaking up the files on disk, but still exposing them to the compiler and linker as a monolithic block, though. That would be completely missing the point of why large source files are bad.

Regards,
Nick.

--
Sent from my phone, thus the relative brevity :)

>
>
> /arry

>
> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> http://mail.python.org/mailman/listinfo/python-dev

> Unsubscribe: http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
>

Barry Warsaw

unread,

Oct 24, 2012, 6:37:37 PM10/24/12

to pytho...@python.org

On Oct 25, 2012, at 08:15 AM, Nick Coghlan wrote:

>OK, I need to weigh in after seeing this kind of reply. Large source files
>are discouraged in general because they're a code smell that points
>strongly towards a *lack of modularity* within a *complex piece of
>functionality*.

Modularity is good, and the file system structure of the project should
reflect that, but to be effective, it needs to be obvious. It's pretty
obvious what's generally in intobject.c. I've worked with code bases where
there's no rhyme nor reason as to what you'd find in a particular file, and
this really hurts.

It hurts even with good tools. Remember that sometimes you don't even know
what you're looking for, so search tools may not be very useful. For example,
sometimes you want to understand how all the pieces fit together, what the
holistic view of the subsystem is, or where the "entry points" are. Search
tools are not very good at this, and if it's a subsystem you only interact
with occasionally, having a file system organization that makes things easier
to remember what you learned the last time you were there helps enormously.

Another point: rather than large files (or maybe in addition to them), large
functions can also be painful to navigate. So just splitting a file into
subfiles may not be the only modularity improvement you can make.

While I'm personally -0 about splitting up unicodeobject.c, if the folks
advocating for it go ahead with it, I just ask that you do it very carefully,
with an eye toward the casual and newbie reader of our code base.

Cheers,
-Barry

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Nick Coghlan

unread,

Oct 24, 2012, 8:03:31 PM10/24/12

to Barry Warsaw, pytho...@python.org

On Thu, Oct 25, 2012 at 8:37 AM, Barry Warsaw <ba...@python.org> wrote:
> On Oct 25, 2012, at 08:15 AM, Nick Coghlan wrote:
>
>>OK, I need to weigh in after seeing this kind of reply. Large source files
>>are discouraged in general because they're a code smell that points
>>strongly towards a *lack of modularity* within a *complex piece of
>>functionality*.
>
> Modularity is good, and the file system structure of the project should
> reflect that, but to be effective, it needs to be obvious. It's pretty
> obvious what's generally in intobject.c. I've worked with code bases where
> there's no rhyme nor reason as to what you'd find in a particular file, and
> this really hurts.
>
> It hurts even with good tools. Remember that sometimes you don't even know
> what you're looking for, so search tools may not be very useful. For example,
> sometimes you want to understand how all the pieces fit together, what the
> holistic view of the subsystem is, or where the "entry points" are. Search
> tools are not very good at this, and if it's a subsystem you only interact
> with occasionally, having a file system organization that makes things easier
> to remember what you learned the last time you were there helps enormously.

And if we were talking in the abstract, I think these would be
reasonable concerns to bring up. However, Victor's proposed division
*is* logical (especially if he goes down the path of a separate
subdirectory which will better support easy searching across all of
the unicode object related files), and I conditioned my +1 with the
requirement that a road map be provided in a leading block comment in
unicodeobject.c.

speed.python.org is also making progress, and once that is up and
running (which will happen well before any Python 3.4 release) it will
be possible to compare the numbers between 3.3 and trunk to help
determine the validity of any concerns regarding optimisations that
can be performed within a module but not across modules.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Stephen J. Turnbull

unread,

Oct 25, 2012, 12:22:03 AM10/25/12

to Nick Coghlan, pytho...@python.org

Nick Coghlan writes:

> OK, I need to weigh in after seeing this kind of reply. Large source files
> are discouraged in general because they're a code smell that points
> strongly towards a *lack of modularity* within a *complex piece of
> functionality*.

Sure, but large numbers of tiny source files are also a code smell,
the smell of purist adherence to the literal principle of modularity
without application of judgment.

If you want to argue that the pragmatic point of view nevertheless is
to break up the file, I can see that, but I think Victor is going too
far. (Full disclosure dept.: the call graph of the Emacs equivalents
is isomorphic to the Dungeon of Zork, so I may be a bit biased.) You
really should speak to the question of "how many" and "what partition".

> the real gain is in *modularity*, making it clear to readers which
> parts can be understood and worked on separately from each other.

Yeah, so which do you think they are? It seems to me that there are
three modules to be carved out of unicodeobject.c:

1. The internal object management that is not exposed to Python:
allocation, deallocation, and PEP 393 transformations.

2. The public interface to Python implementation: methods and
properties, including operators.

3. Interaction with the outside world: codec implementations. But
conceptually, these really don't have anything to do with internal
implementation of Unicode objects. They're just functions that
convert bytes to Unicode and vice versa. In principle they can be
written in terms of ord(), chr(), and bytes(). On the other hand,
they're rather repetitive: "When you've seen one codec
implementation, you've seen them all." I see no harm in grouping
them in one file, and possibly a gain from proximity: casual
passers-by might see refactorings that reduce redundancy.

I'm not sure what to do with the charmap stuff. In current CPython
head it seems incoherent to me: there's an IO codec, but there's also
unicode-to-unicode stuff (PyUnicode_Translate). I haven't had time to
look at Victor's reorganization to see what he actually did with it,
but in terms of modularity, it seems to me that refactoring this stuff
would be a real win, as opposed to splitting the files which is
presentational improvement for the rest of the code which is pretty
modular.

As for Victor's proposal itself:

1176 Objects/unicodecharmap.c
1678 Objects/unicodecodecs.c
1362 Objects/unicodeformat.c
253 Objects/unicodeimpl.h
733 Objects/unicodelegacy.c
1836 Objects/unicodenew.c
2777 Objects/unicodeobject.c
2421 Objects/unicodeoperators.c
1235 Objects/unicodeoscodecs.c
1288 Objects/unicodeutfcodecs.c

As Victor himself admits, "unicodelegacy" and "unicodenew" are not
descriptive of what they contain. In I18N discussions, "legacy" is
usually a deprectory reference to non-Unicode encodings, and I would
tend to guess this file contains codecs from the name. A better name
might be "unicodedeprecated" (if what he really means is deprecated
APIs).

I don't understand why splitting out "unicodeoperators" is a great
idea; it's done nowhere else in CPython. If that makes sense, why not
split out "unicodemethods" (for methods normally invoked explicitly
rather than by syntax) too? N.B. For bytes, the corresponding file is
spelled "bytes_methods".

"unicodecodecs" vs "unicodeutfcodecs": Say what? I would forever be
looking in the wrong one.

"unicodeoscodecs" suggests to me that these codecs are only usable on
some OSes. If so, shouldn't the relevant OS be in the name? If not,
the name is basically misleading IMO.

Why are any of these codecs here in unicodeobjectland in the first
place? Sure, they're needed so that Python can find its own stuff,
but in principle *any* codec could be needed. Is it just an heuristic
that the codecs needed for 99% of the world are here, and other codecs
live in separate modules?

Steve

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Nick Coghlan

unread,

Oct 25, 2012, 2:42:55 AM10/25/12

to Stephen J. Turnbull, pytho...@python.org

On Thu, Oct 25, 2012 at 2:22 PM, Stephen J. Turnbull <ste...@xemacs.org> wrote:
> Nick Coghlan writes:
>
> > OK, I need to weigh in after seeing this kind of reply. Large source files
> > are discouraged in general because they're a code smell that points
> > strongly towards a *lack of modularity* within a *complex piece of
> > functionality*.
>
> Sure, but large numbers of tiny source files are also a code smell,
> the smell of purist adherence to the literal principle of modularity
> without application of judgment.

Absolutely. The classic example of this is Java's unfortunate
insistence on only-one-public-top-level-class-per-file. Bleh.

> If you want to argue that the pragmatic point of view nevertheless is
> to break up the file, I can see that, but I think Victor is going too
> far. (Full disclosure dept.: the call graph of the Emacs equivalents
> is isomorphic to the Dungeon of Zork, so I may be a bit biased.) You
> really should speak to the question of "how many" and "what partition".

Yes, I agree I was too hasty in calling the specifics of Victor's
current proposal a good idea. What raised my ire was the raft of
replies objecting to the refactoring *in principle* for completely
specious reasons like being able to search within a single file
instead of having to use tools that can search across multiple files.

unicodeobject.c is too big, and should be restructured to make any
natural modularity explicit, and provide an easier path for users that
want to understand how the unicode implementation works.

> > the real gain is in *modularity*, making it clear to readers which
> > parts can be understood and worked on separately from each other.
>
> Yeah, so which do you think they are? It seems to me that there are
> three modules to be carved out of unicodeobject.c:
>
> 1. The internal object management that is not exposed to Python:
> allocation, deallocation, and PEP 393 transformations.
>
> 2. The public interface to Python implementation: methods and
> properties, including operators.
>
> 3. Interaction with the outside world: codec implementations. But
> conceptually, these really don't have anything to do with internal
> implementation of Unicode objects. They're just functions that
> convert bytes to Unicode and vice versa. In principle they can be
> written in terms of ord(), chr(), and bytes(). On the other hand,
> they're rather repetitive: "When you've seen one codec
> implementation, you've seen them all." I see no harm in grouping
> them in one file, and possibly a gain from proximity: casual
> passers-by might see refactorings that reduce redundancy.

I suspect you and Victor are in a much better position to thrash out
the details than I am. It was the trend in the discussion to treat the
question as "split or don't split?" rather than "how should we split
it?" when a file that large should already contain some natural
splitting points if the implementation isn't a tangled monolithic
mess.

> Why are any of these codecs here in unicodeobjectland in the first
> place? Sure, they're needed so that Python can find its own stuff,
> but in principle *any* codec could be needed. Is it just an heuristic
> that the codecs needed for 99% of the world are here, and other codecs
> live in separate modules?

I believe it's a combination of history and whether or not they're
needed by the interpreter during the bootstrapping process before the
encodings namespace is importable.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

M.-A. Lemburg

unread,

Oct 25, 2012, 2:57:19 AM10/25/12

to Nick Coghlan, Stephen J. Turnbull, pytho...@python.org

On 25.10.2012 08:42, Nick Coghlan wrote:
>> Why are any of these codecs here in unicodeobjectland in the first
>> place? Sure, they're needed so that Python can find its own stuff,
>> but in principle *any* codec could be needed. Is it just an heuristic
>> that the codecs needed for 99% of the world are here, and other codecs
>> live in separate modules?
>
> I believe it's a combination of history and whether or not they're
> needed by the interpreter during the bootstrapping process before the
> encodings namespace is importable.

They are in unicodeobject.c so that the compilers can inline the
code in the various other places where they are used in the Unicode
implementation directly as necessary and because the codecs use
a lot of functions from the Unicode API (obviously), so the other
direction of inlining (Unicode API in codecs) is needed as well.

BTW: When discussing compiler optimizations, please remember that
there are more compilers out there than just GCC and also the fact
that not everyone is using the latest and greatest version of it.
Link time inlining will usually not be as efficient as compile time
optimization and we need every bit of performance we can get
for Unicode in Python 3.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 25 2012)

>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34

2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

M.-A. Lemburg

unread,

Oct 25, 2012, 3:08:49 AM10/25/12

to Nick Coghlan, Stephen J. Turnbull, pytho...@python.org

On 25.10.2012 08:42, Nick Coghlan wrote:

> unicodeobject.c is too big, and should be restructured to make any
> natural modularity explicit, and provide an easier path for users that
> want to understand how the unicode implementation works.

You can also achieve that goal by structuring the code in unicodeobject.c
in a more modular way. It is already structured in sections, but
there's always room for improvement, of course.

As mentioned before, it is impossible to split out various sections
into separate .c or .h files which then get included in the main
unicodeobject.c. If that's where consensus is going, I'm with Stephen
here in that such a separation should be done in higher level
chunks, rather than creating >10 new files.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 25 2012)
>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34
2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Maciej Fijalkowski

unread,

Oct 25, 2012, 5:18:53 AM10/25/12

to M.-A. Lemburg, Stephen J. Turnbull, Nick Coghlan, pytho...@python.org

On Thu, Oct 25, 2012 at 8:57 AM, M.-A. Lemburg <m...@egenix.com> wrote:
> On 25.10.2012 08:42, Nick Coghlan wrote:
>>> Why are any of these codecs here in unicodeobjectland in the first
>>> place? Sure, they're needed so that Python can find its own stuff,
>>> but in principle *any* codec could be needed. Is it just an heuristic
>>> that the codecs needed for 99% of the world are here, and other codecs
>>> live in separate modules?
>>
>> I believe it's a combination of history and whether or not they're
>> needed by the interpreter during the bootstrapping process before the
>> encodings namespace is importable.
>
> They are in unicodeobject.c so that the compilers can inline the
> code in the various other places where they are used in the Unicode
> implementation directly as necessary and because the codecs use
> a lot of functions from the Unicode API (obviously), so the other
> direction of inlining (Unicode API in codecs) is needed as well.

I'm sorry to interrupt, but have you actually measured? What effect
the lack of said inlining has on *any* benchmark is definitely beyond
my ability to guess and I suspect is beyond the ability to guess of
anyone else on this list.

I challenge you to find a benchmark that is being significantly
affected (>15%) with the split proposed by Victor. It does not even
have to be a real-world one, although that would definitely buy it
more credibility.

Cheers,
fijal

M.-A. Lemburg

unread,

Oct 25, 2012, 5:49:48 AM10/25/12

to Maciej Fijalkowski, Stephen J. Turnbull, Nick Coghlan, pytho...@python.org

On 25.10.2012 11:18, Maciej Fijalkowski wrote:
> On Thu, Oct 25, 2012 at 8:57 AM, M.-A. Lemburg <m...@egenix.com> wrote:
>> On 25.10.2012 08:42, Nick Coghlan wrote:
>>>> Why are any of these codecs here in unicodeobjectland in the first
>>>> place? Sure, they're needed so that Python can find its own stuff,
>>>> but in principle *any* codec could be needed. Is it just an heuristic
>>>> that the codecs needed for 99% of the world are here, and other codecs
>>>> live in separate modules?
>>>
>>> I believe it's a combination of history and whether or not they're
>>> needed by the interpreter during the bootstrapping process before the
>>> encodings namespace is importable.
>>
>> They are in unicodeobject.c so that the compilers can inline the
>> code in the various other places where they are used in the Unicode
>> implementation directly as necessary and because the codecs use
>> a lot of functions from the Unicode API (obviously), so the other
>> direction of inlining (Unicode API in codecs) is needed as well.
>
> I'm sorry to interrupt, but have you actually measured? What effect
> the lack of said inlining has on *any* benchmark is definitely beyond
> my ability to guess and I suspect is beyond the ability to guess of
> anyone else on this list.
>
> I challenge you to find a benchmark that is being significantly
> affected (>15%) with the split proposed by Victor. It does not even
> have to be a real-world one, although that would definitely buy it
> more credibility.

I think you misunderstood. What I described is the reason for having
the base codecs in unicodeobject.c.

I think we all agree that inlining has a positive effect on
performance. The scale of the effect depends on the used compiler
and platform.

Victor already mentioned that he'll check the impact of his
proposal, so let's wait for that.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 25 2012)
>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34
2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Serhiy Storchaka

unread,

Oct 25, 2012, 5:49:57 AM10/25/12

to pytho...@python.org

On 25.10.12 12:18, Maciej Fijalkowski wrote:
> I challenge you to find a benchmark that is being significantly
> affected (>15%) with the split proposed by Victor. It does not even
> have to be a real-world one, although that would definitely buy it
> more credibility.

I see 10% slowdown for UTF-8 decoding for UCS2 strings, but 10% speedup
for mostly-BMP UCS4 strings. For encoding the situation is reversed
(but up to +27%). Charmap decoding speedups 10-30%.

GCC 4.4.3, 32-bit Linux.

https://bitbucket.org/storchaka/cpython-stuff/src/default/bench

Maciej Fijalkowski

unread,

Oct 25, 2012, 6:07:41 AM10/25/12

to M.-A. Lemburg, Stephen J. Turnbull, Nick Coghlan, pytho...@python.org

>
> I think you misunderstood. What I described is the reason for having
> the base codecs in unicodeobject.c.
>
> I think we all agree that inlining has a positive effect on
> performance. The scale of the effect depends on the used compiler
> and platform.
>

Well. Inlining can have positive or negative effects, depending on
various details. Too much inlining causes more cache misses for
example. However, this is absolutely irrelevant if you don't create
benchmarks and run them. Guessing is seriously not a very good
optimization strategy.

Cheers,
fijal

Serhiy Storchaka

unread,

Oct 25, 2012, 6:09:21 AM10/25/12

to pytho...@python.org

On 25.10.12 12:49, M.-A. Lemburg wrote:
> I think you misunderstood. What I described is the reason for having
> the base codecs in unicodeobject.c.

For example PyUnicode_FromStringAndSize and PyUnicode_FromString are
thin wrappers around PyUnicode_DecodeUTF8Stateful. I think this is a
reason to keep this functions together.

Nick Coghlan

unread,

Oct 25, 2012, 7:11:18 AM10/25/12

to Maciej Fijalkowski, Stephen J. Turnbull, pytho...@python.org, M.-A. Lemburg

On Thu, Oct 25, 2012 at 8:07 PM, Maciej Fijalkowski <fij...@gmail.com> wrote:
>>
>> I think you misunderstood. What I described is the reason for having
>> the base codecs in unicodeobject.c.
>>
>> I think we all agree that inlining has a positive effect on
>> performance. The scale of the effect depends on the used compiler
>> and platform.
>>
>
> Well. Inlining can have positive or negative effects, depending on
> various details. Too much inlining causes more cache misses for
> example. However, this is absolutely irrelevant if you don't create
> benchmarks and run them. Guessing is seriously not a very good
> optimization strategy.

Yep, that's why I made the point that speed.python.org should be a
going concern well before 3.4 release, and will be able to let us know
if we have a problem relative to 3.3.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Antoine Pitrou

unread,

Oct 25, 2012, 10:52:55 AM10/25/12

to pytho...@python.org

Le 25/10/2012 02:03, Nick Coghlan a écrit :
>
> speed.python.org is also making progress, and once that is up and
> running (which will happen well before any Python 3.4 release) it will
> be possible to compare the numbers between 3.3 and trunk to help
> determine the validity of any concerns regarding optimisations that
> can be performed within a module but not across modules.

Nobody needs speed.python.org to run benchmarks before and after a
specific change, though. Cloning http://hg.python.org/benchmarks and
using the perf.py runner is everything that is needed.

Moreover, you would want to run benchmarks *before* committing and
pushing the changes. We don't want the huge splitting to be recorded and
then backed out in the repository history.

Regards

Antoine.

Antoine Pitrou

unread,

Oct 25, 2012, 10:56:11 AM10/25/12

to pytho...@python.org

Le 25/10/2012 00:15, Nick Coghlan a écrit :
>
> However, -1 on the "faux modularity" idea of breaking up the files on
> disk, but still exposing them to the compiler and linker as a monolithic
> block, though. That would be completely missing the point of why large
> source files are bad.

I disagree with you. Source files are meant to be read by humans, we
don't really care whether the compiler has a modular view of the source
code.

Regards

Antoine.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Larry Hastings

unread,

Oct 25, 2012, 11:13:53 AM10/25/12

to Nick Coghlan, pytho...@python.org

On 10/24/2012 03:15 PM, Nick Coghlan wrote:

Breaking such files up into separately compiled modules serves two purposes:

1. It proves that the code *isn't* a tangled monolithic mess;
2. It enlists the compilation toolchain's assistance in ensuring that remains the case in the future.

Either the code is a "tangled monolithic mess" or it isn't. If it is, then let's fix that, regardless of the size of the file. If it isn't, I don't see breaking up the code among multiple files as providing any benefit. And I see no need for the toolchain's assistance to help us do something without benefit. The line count of the file is essentially unrelated to its inherent quality / maintainability.

We are not special snow flakes - good software engineering practice is advisable for us as well, so a big +1 from me for breaking up the monstrosity that is unicodeobject.c and lowering the barrier to entry for hacking on the individual pieces. This should come with a large block comment in unicodeobject.c explaining how the pieces are put back together again.

I'm all for good software engineering practice. But can you cite objective reasons why large source files are provably bad? Not "tangled monolithic messes", not poorly-factored code. I agree that those are bad--but so far nobody has proposed that either of those is true about unicodeobject.c (unless you are implicitly doing so above), nor have they proposed credible remedies. All I've seen is that unicodeobject.c is a large file, and some people want to break it up into smaller files. I have yet to see anything but handwaving as justification. For example, what is this barrier to entry you suggest exists to hacking on the str object, that will apparently be dispelled simply by splitting one file into multiple files?

Someone proposed breaking up unicodeobject.c into three distinct subsystems and putting those in separate files. I still don't agree. It seems natural to me to have everything associated with the str object in one file, just as we do with every other object I can think of. If this were a genuinely good idea, we should consider doing it with every similar object. But nobody is proposing that. My guess is because the other files in CPython are "small enough". At which point we're right back to the primary motivation simply being the line count of unicodeobject.c, as a purely aesthetic and subjective judgment.

/arry

Antoine Pitrou

unread,

Oct 25, 2012, 5:39:19 PM10/25/12

to pytho...@python.org

On Thu, 25 Oct 2012 08:13:53 -0700
Larry Hastings <la...@hastings.org> wrote:
>
> I'm all for good software engineering practice. But can you cite
> objective reasons why large source files are provably bad? Not "tangled
> monolithic messes", not poorly-factored code. I agree that those are
> bad--but so far nobody has proposed that either of those is true about
> unicodeobject.c (unless you are implicitly doing so above)

Well, "tangled monolithic mess" is quite true about unicodeobject.c,
IMO.
Seriously, I agree with Victor: navigating around unicodeobject.c is a
PITA. Perhaps it isn't if you are using emacs, or you have 35 fingers,
or just a lot of spare time, but in my experience it's painful.

Regards

Antoine.

_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Stephen J. Turnbull

unread,

Oct 25, 2012, 10:35:38 PM10/25/12

to Antoine Pitrou, pytho...@python.org

Antoine Pitrou writes:

> Well, "tangled monolithic mess" is quite true about unicodeobject.c,
> IMO.

s/object.c// and your point remains valid. Just reading the table of
contents for UTR#17 (http://www.unicode.org/reports/tr17/) should
convince you that it's not going to be easy to produce an elegant
implementation!

> Seriously, I agree with Victor: navigating around unicodeobject.c is a
> PITA. Perhaps it isn't if you are using emacs, or you have 35 fingers,
> or just a lot of spare time, but in my experience it's painful.

Sure, but I don't know of a Unicode implementation which isn't.

I don't think that having a unicode/*.[ch] with a dozen files
(including the README etc) in it is going to make it much more
navigable. If there are too many files, it's going to be a PITA to
maintain because there won't be an obvious place to put certain
functions. Eg, I've already mentioned my suspicions about the charmap
code (I apologize for not reading Victor's code to confirm them).

I don't object in principle to splitting the unicodeobject.c. At the
very least, with all due respect to MAL, XEmacs experience with coding
systems (the Emacs equivalent of Python codecs) suggests that there is
very little to be lost by moving the codec implementations to a
separate file from the Unicode object implementation. (Here I'm
talking about codecs in the narrow sense of wire-format to Python3 str
and back, not the more general Python2 sense that included zip and
base64 and so on. Ie, PyUnicode_Translate is not a codec in the
relevant sense.)

On the other hand, I wouldn't be surprised if (despite my earlier
suggestion) codecs and unicode object internals need a close
relationship. (My intuition and sense of style says splitting codecs
from the low level memory management and PEP 393 stuff is a good idea,
but I'm not confident it would have no impact on performance.)

Reply all

Reply to author

Forward