Ned Batchelder's hyphenate

7 views
Skip to first unread message

doug.na...@gmail.com

unread,
Jul 9, 2007, 10:30:20 PM7/9/07
to Django developers
Ned just posted the code for the tabblo hyphenate filter in the public
domain. This should be added as a builtin django filter with proper
attribution. I don't think wordwrap should use it by default, and
optional arguments don't work. I was thinking of just calling it
'hyphenate' or 'hyphenatedwordwrap'.

http://www.nedbatchelder.com/code/modules/hyphenate.html

Thoughts?

Jacob Kaplan-Moss

unread,
Jul 9, 2007, 10:48:18 PM7/9/07
to django-d...@googlegroups.com
On 7/9/07, doug.na...@gmail.com <doug.na...@gmail.com> wrote:
> Ned just posted the code for the tabblo hyphenate filter in the public
> domain.
[snip]
> Thoughts?

Maybe an addition to django.contrib.humanize?

Jacob

doug.na...@gmail.com

unread,
Jul 9, 2007, 10:54:50 PM7/9/07
to Django developers
On further reflection, there is a huge internationalization issue
here. The hyphenation rules and data driven exceptions are English
specific. Some will work (minimally) for other languages, but are not
good enough. Proper integration will be required, and language
developers will need to have more knowledge about this corner domain.
Due to my NDA/NC I cannot work on that part of it, but I do have a
patch almost ready for django.utils.text.wordwrap to take an optional
boolean argument to do word hyphenation.

Thankfully it is data driven and getting the data from the .po should
not be too difficult. The problem will be getting the initial data.

On Jul 9, 10:30 pm, "doug.napole...@gmail.com"

Todd O'Bryan

unread,
Jul 9, 2007, 11:07:43 PM7/9/07
to django-d...@googlegroups.com
On Tue, 2007-07-10 at 02:54 +0000, doug.na...@gmail.com wrote:
> On further reflection, there is a huge internationalization issue
> here. The hyphenation rules and data driven exceptions are English
> specific. Some will work (minimally) for other languages, but are not
> good enough. Proper integration will be required, and language
> developers will need to have more knowledge about this corner domain.
> Due to my NDA/NC I cannot work on that part of it, but I do have a
> patch almost ready for django.utils.text.wordwrap to take an optional
> boolean argument to do word hyphenation.

I seem to remember that Knuth did a pretty amazing job with hyphenation
in TeX, most of it was algorithmic, and there were hyphenation engines
for at least a few languages.

I'll have to dig out my copy of The TeXbook to look (yes, I'm one of
those nerds who has the five-volume box set of TeX and MetaFont books),
but this may be something that somebody's already done a really good job
on.

Todd

doug.na...@gmail.com

unread,
Jul 9, 2007, 11:28:14 PM7/9/07
to Django developers

On Jul 9, 11:07 pm, "Todd O'Bryan" <toddobr...@mac.com> wrote:
> I seem to remember that Knuth did a pretty amazing job with hyphenation
> in TeX, most of it was algorithmic, and there were hyphenation engines
> for at least a few languages.

Ned's implementation is taken directly from this, and I hope you are
correct. The code Ned put up contains data from the public domain and
was most likely restricted due to that.

Tom Tobin

unread,
Jul 9, 2007, 11:59:41 PM7/9/07
to django-d...@googlegroups.com
> The code Ned put up contains data from the public domain and
> was most likely restricted due to that.

I'm not sure what you mean by this; "public domain" means anyone can
do pretty much whatever they want with it, without restriction.

doug.na...@gmail.com

unread,
Jul 10, 2007, 12:22:50 AM7/10/07
to Django developers

On Jul 9, 11:59 pm, "Tom Tobin" <korp...@korpios.com> wrote:
> I'm not sure what you mean by this; "public domain" means anyone can
> do pretty much whatever they want with it, without restriction.

I mean he wanted his code in the public domain with working data so
that restricted him to data which was also in the public domain, and
not LaTeX data under a BSD or other license. NOTE: this is just me
making wild assumptions at this point.


doug.na...@gmail.com

unread,
Jul 10, 2007, 12:50:13 AM7/10/07
to Django developers
Ticket with initial patch made: http://code.djangoproject.com/ticket/4821

It still needs documentation and unit testing (and
internationalization) but it is a start. Will try to get to the doc
and test this weekend.

-Doug

Russell Keith-Magee

unread,
Jul 10, 2007, 2:11:58 AM7/10/07
to django-d...@googlegroups.com
On 7/10/07, doug.na...@gmail.com <doug.na...@gmail.com> wrote:
>
> Ticket with initial patch made: http://code.djangoproject.com/ticket/4821
>
> It still needs documentation and unit testing (and
> internationalization) but it is a start. Will try to get to the doc
> and test this weekend.

+1 from me.

However, like Jacob said, I think it should be in contrib.humanize,
not utils.text.

Yours,
Russ %-)

Ned Batchelder

unread,
Jul 10, 2007, 7:07:35 AM7/10/07
to django-d...@googlegroups.com
Todd, good to meet a fellow nerd: I also have the five-volume hardcover set. My code is implemented from appendix H of volume 1 (or is it volume A?).

--Ned.
-- 
Ned Batchelder, http://nedbatchelder.com

Ned Batchelder

unread,
Jul 10, 2007, 7:13:53 AM7/10/07
to django-d...@googlegroups.com
Since the algorithm is identical to the one used by TeX, the hyphenation data can be taken from there as well.  I used a TeX distribution to get the latest patterns for English to include in the module.  I installed MiKTeX, and dug around in the tex/generic/hyphen directory to find them.  There are also French and German patterns in that distro, and there may be other hyphenation data sets in other repositories on the web, I haven't looked.

--Ned.

doug.na...@gmail.com

unread,
Jul 10, 2007, 2:26:34 PM7/10/07
to Django developers

On Jul 9, 10:48 pm, "Jacob Kaplan-Moss" <jacob.kaplanm...@gmail.com>
wrote:

> Maybe an addition to django.contrib.humanize?

If we decide to only support English, then I am fine with including
this as part of django.contrib.humanize.
If we decide to properly internationalize humanize, then I am fine
with that as well. (you don't use commas in German, you use periods
for instance).

There are four reasons why I feel it is better to have this as part of
the core:
1. Hyphenation is a media standard and crucial for non-html templates.
Sites which want to generate printable PDF's of say conference
programs, or in a standard news media style will want this as much as
they want pluralize, widthratio, rjust, and center. This is more than
a template filter, but is a text utility.

2. reduce duplication of code and confusion
The actual code being duplicated is extremely minimal, but having two
text wrappers in very different locations is confusing to both
developers and users. For template filters, it would be better to have
them documented together.

3. Internationalization
To properly implement this we need to integrate with the
internationalization code and have the core language developers help
with maintaining the hyphenation rules. It does not feel DRY to have a
separate internationalization system in humanize, and it does not seem
right to have sections of the core only used by a contrib module
(though this is done for admin).

In the end if those wiser than I decide it should be in humanize I
have no problem changing the patch and writing up the doc and unit
tests. I will need help with the internationalization parts. I do not
have enough experience with the i18n system to make the proper
architectural decisions. For the translated text, the wrapping should
use the locale middleware specified hyphenation rules. For text which
has not been translated, it should use the native LANGUAGE_CODE rules.
Not sure how to get that working.

-Doug


Malcolm Tredinnick

unread,
Jul 10, 2007, 11:12:10 PM7/10/07
to django-d...@googlegroups.com
On Tue, 2007-07-10 at 18:26 +0000, doug.na...@gmail.com wrote:
>
> On Jul 9, 10:48 pm, "Jacob Kaplan-Moss" <jacob.kaplanm...@gmail.com>
> wrote:
> > Maybe an addition to django.contrib.humanize?
>
> If we decide to only support English, then I am fine with including
> this as part of django.contrib.humanize.
> If we decide to properly internationalize humanize, then I am fine
> with that as well. (you don't use commas in German, you use periods
> for instance).

Don't decide that this hinges on "fully internationalize humanize or it
shouldn't go there". Incremental changes are good.

> There are four reasons why I feel it is better to have this as part of
> the core:
> 1. Hyphenation is a media standard and crucial for non-html templates.
> Sites which want to generate printable PDF's of say conference
> programs, or in a standard news media style will want this as much as
> they want pluralize, widthratio, rjust, and center. This is more than
> a template filter, but is a text utility.

Not seeing why that does or doesn't support your argument. It's not
something you need all the time (more appropriate to print layout than
HTML, as a rule), so including it by default, given that HTML output is
the common case, isn't a requirement (and saves on memory usage when its
not included, for example). Having it in contrib/ puts it exactly one
import away.

> 2. reduce duplication of code and confusion
> The actual code being duplicated is extremely minimal, but having two
> text wrappers in very different locations is confusing to both
> developers and users. For template filters, it would be better to have
> them documented together.
>
> 3. Internationalization
> To properly implement this we need to integrate with the
> internationalization code and have the core language developers help
> with maintaining the hyphenation rules. It does not feel DRY to have a
> separate internationalization system in humanize, and it does not seem
> right to have sections of the core only used by a contrib module
> (though this is done for admin).

This is a based on a mistaken assumption, it look like. Everything in
contrib is already supported by translators.

However, there's another consideration here, too: it's highly unlikely
that a normal translator will be able to maintain the hyphenation
databases. They are very technical data structures.

>
> In the end if those wiser than I decide it should be in humanize I
> have no problem changing the patch and writing up the doc and unit
> tests. I will need help with the internationalization parts. I do not
> have enough experience with the i18n system to make the proper
> architectural decisions.

I was thinking about this a bit yesterday. It shouldn't be too hard. I'm
a few days away from implementing anything, since I'm not going to
instantly bump this to the top of my list, but it's a solvable problem.

For my money, if we include this, putting it in contrib somewhere feels
better. It will also make maintenance easier, since we can give Ned (or
designated sock puppet) commit access to that part of the tree for
ongoing bug fixes.

Regards,
Malcolm

--
The early bird may get the worm, but the second mouse gets the cheese.
http://www.pointy-stick.com/blog/

doug.na...@gmail.com

unread,
Jul 11, 2007, 12:49:05 AM7/11/07
to Django developers

On Jul 10, 11:12 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:


> Don't decide that this hinges on "fully internationalize humanize or it
> shouldn't go there". Incremental changes are good.

agreed.

>
> > There are four reasons why I feel it is better to have this as part of
> > the core:
> > 1. Hyphenation is a media standard and crucial for non-html templates.
> > Sites which want to generate printable PDF's of say conference
> > programs, or in a standard news media style will want this as much as
> > they want pluralize, widthratio, rjust, and center. This is more than
> > a template filter, but is a text utility.
>
> Not seeing why that does or doesn't support your argument. It's not
> something you need all the time (more appropriate to print layout than
> HTML, as a rule), so including it by default, given that HTML output is
> the common case, isn't a requirement (and saves on memory usage when its
> not included, for example). Having it in contrib/ puts it exactly one
> import away.

wordwrap, center, rjust, and widthratio (for most uses) are more
appropriate for print layout than HTML. The proper way to implement
these in HTML is with CSS, yet they are all part of the existing
default filters. When it comes to the templates, this is just a
specialized form of wordwrap. If the argument is that this is more for
printed forms and not of real general use for the most common html
generation and take up memory and adds bloat, then I question the
inclusion of these other filters, django.utils.text.wrap and other
utilities as well. At least that was my point (and admittedly a weak
one :-) The astute will notice that I left off my fourth argument (it
was just too weak).

> > I will need help with the internationalization parts. I do not
> > have enough experience with the i18n system to make the proper
> > architectural decisions.
>
> I was thinking about this a bit yesterday. It shouldn't be too hard. I'm
> a few days away from implementing anything, since I'm not going to
> instantly bump this to the top of my list, but it's a solvable problem.

I welcome any and all help!!! I don't see this as anything crucial or
time sensitive.
For my part I want this feature for a project, and at worst can
include the code as an app there. I just believe it should be part of
the django distribution instead of some third party addon.

-Doug

Christopher Lenz

unread,
Jul 13, 2007, 10:54:38 AM7/13/07
to django-d...@googlegroups.com
Am 11.07.2007 um 07:49 schrieb doug.na...@gmail.com:
> I welcome any and all help!!! I don't see this as anything crucial or
> time sensitive.
> For my part I want this feature for a project, and at worst can
> include the code as an app there. I just believe it should be part of
> the django distribution instead of some third party addon.

Um, you do realize that you don't normally need any kind of
hyphenation in web applications?

Personally, while I love that Ned's coded this thing because I happen
to need it for one of my current projects, I don't see it as a need
common enough for inclusion in Django, or even a Django contrib thing.

If you want to build on Ned's work and extend it so that it's
properly internationalized and all that, just make it a separate
project. Most of the time, the projects where you may need
hyphenation aren't even web applications, so including the
functionality in Django would be rather inconvenient for those. And
doing this stuff as a separate project does not in any way preclude
using it in a Django app.

Just my 2 cents.

Cheers,
Chris
--
Christopher Lenz
cmlenz at gmx.de
http://www.cmlenz.net/

doug.na...@gmail.com

unread,
Jul 13, 2007, 4:58:31 PM7/13/07
to Django developers

On Jul 13, 10:54 am, Christopher Lenz <cml...@gmx.de> wrote:
> Um, you do realize that you don't normally need any kind of
> hyphenation in web applications?

All depends on what you mean by 'web applications'.
This is defiantly a good feature for mobile devices. Also there are
many filters currently in default filters which have no use whatsoever
for web services. wordwrap, center, and rjust have no purpose either.
I would argue that hyphenation is more useful than those, but again I
am biased.

At this point it looks like I will start it as a separate app, and
then if enough people find it useful, revisit the matter.
The problem is, as I have stated before, due to NDA/NC I can not work
on the natural language or translation aspects of it, only
integration. The pay job wins.

-Doug


Reply all
Reply to author
Forward
0 new messages