is there a multi-language hyphenation solution for current ReportLab?
I found the "wordaxe" package, but seems a little old and unmaintained... In
the current RL source there are just a few mentions to the capability, but I
fail to see how one could inject a language-specific hyphenator into the
Paragraph class.
Thanks in advance for any hint,
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
le...@metapensiero.it | -- Fortunato Depero, 1929.
_______________________________________________
reportlab-users mailing list
reportl...@lists2.reportlab.com
https://pairlist2.pair.net/mailman/listinfo/reportlab-users
I've used Henning's wordaxe and its previous incarnations, but it's too long ago to give any useful advice now (also while still travelling back from vacation).
In any case that topic would be one more cool "pro" thing to be covered in Mike's emerging book, if he's still looking for topics. ;-)
Cheers,
Dinu
Lele Gaifax <le...@metapensiero.it>:
Hyphenation has always been rather low on the priority list, given that average word lengths in English are pretty short compared to German or French, say.
I've used Henning's wordaxe and its previous incarnations, but it's too long ago to give any useful advice now (also while still travelling back from vacation).
In any case that topic would be one more cool "pro" thing to be covered in Mike's emerging book, if he's still looking for topics. ;-)
Cheers,
Dinu
Lele Gaifax <le...@metapensiero.it>:
>
> Hi all,
>
> is there a multi-language hyphenation solution for current ReportLab?
>
> I found the "wordaxe" package, but seems a little old and unmaintained... In
> the current RL source there are just a few mentions to the capability, but I
> fail to see how one could inject a language-specific hyphenator into the
> Paragraph class.
>
> Thanks in advance for any hint,
> ciao, lele.
> --
> nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
> real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
> le...@metapensiero.it | -- Fortunato Depero, 1929.
>
> _______________________________________________
> reportlab-users mailing list
if newWidth>maxWidth:
nmw = min(lineno,maxlineno)
if wordWidth>max(maxWidths[nmw:nmw+1]) and not isinstance(word,_SplitText) and splitLongWords:
#a long word
words[0:0] = _splitWord(word,maxWidth-spaceWidth-currentWidth,maxWidths,lineno,fontName,fontSize,self.encoding)
self._splitLongWordCount += 1
continue
so logically it ought to be possible to try and hyphenate before arbitrarily splitting. However, looking at the wordaxe code it
seems to try and do a lot more that just allowing splits inside long words. I think the guts are in the function findBestSolution
which is applied at the overflow and the code that follows immediately. I will have a go at getting the wordaxe stff to work
inside latest reportlab when I get to rest after an upcoming surgery :(
--
Robin Becker
On 23/04/2018 15:47, Mike Driscoll wrote:
> Hi Dinu and Lele,
>
> I haven't actually used wordaxe before. I had thought I had seen something
> on the mailing list about ReportLab having better support for this
> built-in, but I can't find anything obvious with Google. Dinu, did
> ReportLab ever include your changes that you mentioned in this thread:
> https://pairlist2.pair.net/pipermail/reportlab-users/2008-September/007283.html
> ?
>
> If I had some samples of what it looks like before wordaxe is applied and
> then after, I think I would understand it a bit better. I have a general
> idea, but I prefer concrete examples. Then maybe I could put together a
> tutorial around the subject. It is too bad that wordaxe looks like its been
> dead for 8 years.
>
> Here's a link to the book Dinu mentioned: https://leanpub.com/reportlab/ I
> still have around 4 chapters left to write, but you can get what's
> currently done if you're interested (plus free updates)
>
> Mike
>
> -----------------
> Mike Driscoll
>
> Blog: http://blog.pythonlibrary.org
> Books: Python 101 <https://gum.co/py101>, Python 201: Intermediate Python
> <https://gum.co/py201>, wxPython Recipes
> <https://www.apress.com/us/book/9781484232361>, Python Interviews
> <https://www.packtpub.com/web-development/python-interviews>
my “alternative paragraphs” didn’t go anywhere after I put them online. As I said, hyphenation was always very low priority for ReportLab, and it was always hard to find a larger audience who would see the value behind it. Even today it’s not considered important on the web.
I was browsing a few hours ago on “python hyphenation” and found some stuff I was not aware of, like http://pyphen.org. Much of this seems to build on language dictionaries in OpenOffice/LibreOffice, like wordaxe. Unfortunately, I could not convince Henning to package his wordaxe much earlier using distuils and even now it only shows a one line description on PyPI:
https://pypi.org/project/wordaxe/#description
I’ve added him on CC to this message as I’m not sure he’s still on this list.
I’ve digged out a sample project of mine from 2008 that uses wordaxe with German text. The PDFs are ca. 250 KB and too large for this list (which has some low limit like 40 KB or so). I’m happy to send it all to you via email (if it’s still working), not sure it’s still working.
Concerning your book: I’ve backed it on Kickstarter already (some greater package), so feel free to point me to any sneak preview. I just don’t have the time to provide extended, substantial feedback.
Cheers,
Dinu
Ok, the “(if it’s still working)” was meant to refer to Henning’s old email address…
Cheers,
Dinu
> I was browsing a few hours ago on “python hyphenation” and found some stuff
> I was not aware of, like http://pyphen.org.
Thank you Dinu,
pyphen API is so straightforward that I could not resist trying to inject it
in the process, so I spent an hour this morning and I wrote a quick&dirty
hack, that is already able to handle the simplest case.
I wrote a PyphenParagraph class that accepts a "hyphenator" instance in its
constructor, overriding the "breakLines()" method and extending the "split()"
method. In "breakLines()", whenever it meets a word that does not fit in the
available space it calls a new "hyphenateWord()" method that may return a
(headWord, tailWord) pair on success, that it pushes back in the "words" list.
Basically:
class PyphenParagraph(Paragraph):
def __init__(self, *args, hyphenator=None, **kwargs):
self.hyphenator = hyphenator
super().__init__(*args, **kwargs)
def split(self, availWidth, availHeight):
# Propagate the hyphenator to the splitted paragraphs: parent's split() uses
# "self.__class__(foo, bar, spam=eggs)" to create them...
pair = super().split(availWidth, availHeight)
if pair:
pair[0].hyphenator = pair[1].hyphenator = self.hyphenator
return pair
def hyphenateWord(self, word, availWidth, fontName, fontSize):
for head, tail in self.hyphenator.iterate(word):
head += '-'
width = stringWidth(head, fontName, fontSize, self.encoding)
if width <= availWidth:
return _SplitText(head), tail
def breakLines(self, width):
... # untouched code up to
while words:
word = words.pop(0)
#this underscores my feeling that Unicode throughout would be easier!
wordWidth = stringWidth(word, fontName, fontSize, self.encoding)
newWidth = currentWidth + spaceWidth + wordWidth
if newWidth>maxWidth:
if self.hyphenator is not None and not isinstance(word, _SplitText):
pair = self.hyphenateWord(word, maxWidth - spaceWidth - currentWidth,
fontName, fontSize)
if pair is not None:
words[0:0] = pair
continue
... # untouched code till the end
However, I must be missing something in the "width" argument, because for
example when using a ImageAndFlowables it clearly uses the wrong width in the
"second" part (where the image ends so there's a wider space available)...
Anyway, before going any further in my experiments, I would like to know if I
am on a good track or not, to avoid wasting energy :-)
Here is my script: https://gist.github.com/lelit/9c1cba52fd6dd9f1123fe82ce4b788db
It obviously require a "pip install pyphen" and a copy of RL's
tests/pythonpowered.gif: executing it you will get a simple document with two
paragraphs, the first with an image in its top left corner and a second plain
paragraph. The latter is correct, while in the former you can spot a "bogus"
hyphenation is happening in the "Les-ser GPL" line...
> However, I must be missing something in the "width" argument, because for
> example when using a ImageAndFlowables it clearly uses the wrong width in the
> "second" part (where the image ends so there's a wider space available)...
I think I figured it out and spotted the problem with ImageAndFlowables: the
widget basically first calls .wrap() on the paragraphs on image's side, that
in turn compute their blPara attribute, then if the height of the flowables is
taller than the image's it splits the paragraphs, and that's where it does the
wrong thing (for my purpose, of course): the Paragraph.split() method cuts it
in two halves, rebuilding their respective frags attribute from blPara
honoring the height constraint, and thus possibly already hyphenated words end
into the ParaFrag structure and are eventually rendered as-is into a possibly
wider context.
Fixing the issue is complicated, because the Paragraph.split() method calls
plain functions (_split_blParaSimple() and _split_blParaHard()) so I should
override the whole method simply to call a "smarter" implementation that knows
about _SplitText and _SplitList...
At this point, given that my need for ImageAndFlowables is marginal at best, I
won't be feeding it PyphenParagraphs, using them only in "plain" containers.
I updated my gist to what I'm currently using, previous version had a glitch
with multifrag paragraphs.
Thanks in advance for any further insight.
> I have mostly got hyphenation working using Pyphen. Currently about 5 tests fail because of
> simple paragraph corner cases involving splits.
>
> I will try and finish a working version next week.
Great news, thank you!
> Unfortunately my simple approach is also trying to hyphenate things like
> URLS which I suppose should be handled separately.
>
> Also currently I lack a way to get just the word and not non-alpahabetics. I
> suppose that should be easy if we know what constitutes hyphenatable matter.
> Any ideas welcome.
IMHO I would leave such decision to the final user, as I do not think there is
one single *right* answer... Even for URLs, which I actually happen to manage
in the app where I experimented this matter, I could not reach a consensus on
what should happen, I mean between what currently happens:
|----------------------------------------------|
|This is a long URL: https://hostname/contentna|
|me/whatelse
and with pyphen:
|----------------------------------------------|
|This is a long URL: https://hostname/content- |
|name/whatelse
it's obviously debatable...
> I am presently running with the idea of using a string setting to indicate
> what language so my settings override has
>
> hyphenationLang='en_GB'
>
> which corresponds to one of the pyphen dictionaries. This gets into the
> style and is used only if pyphen can be imported.
Not sure what you mean here, but just to make my use case clear: the app I'm
currently developing is multilingual, and produces several PDFs for a given
"item", one for each language the customer decided to support. So having a
"static" setting for the target language would not work very well for me...
best would be having a way to pass the hyphenator to the Paragraph
constructor, possibly taking a default from the SimpleDocument...
I will surely try out your solution as soon as it hits the repository and will
report back.
Thanks again,
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
le...@metapensiero.it | -- Fortunato Depero, 1929.
_______________________________________________
I will be surely checking this as soon as possible and will report back.
> I have checked in a branch 'hyphenation' with a working approach to
> hyphenation.
I added some comments to the commit.
Robin Becker <ro...@reportlab.com> writes:
> I have checked in a branch 'hyphenation' with a working approach to
> hyphenation.
I added some comments to the commit.
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
le...@metapensiero.it | -- Fortunato Depero, 1929.
_______________________________________________
reportlab-users mailing list
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
le...@metapensiero.it | -- Fortunato Depero, 1929.
_______________________________________________
reportlab-users mailing list
reportlab-users@lists2.reportlab.com
https://pairlist2.pair.net/mailman/listinfo/reportlab-users
> The hyphenator is not being propagated to any split sub pararagraphs. I
> think that we don't want to go down the route of supporting large numbers
> of optional attributes on the paragraph instances that is what the style is
> for. I tried this and it seems to work.
Oh, right, I forgot about that, thank you! I will keep using a custom
Paragraph subclass.
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
le...@metapensiero.it | -- Fortunato Depero, 1929.
_______________________________________________
reportlab-users mailing list
reportl...@lists2.reportlab.com
https://pairlist2.pair.net/mailman/listinfo/reportlab-users
> probably I will drop the support for an explicit attribute. If that's done
> then the easy way for you to subclass Paragraph would be something like
>
> from reportlab.platypus import Paragraph
> class LeleParagraph(Paragraph):
> def __init__(self, text, style, **kwds):
> hyphenator = kwds.pop('hyphenator',kwds.pop('hyphenationLang',None))
> if hyphenator:
> #override the default
> style = style.clone(style.name+'-hyphenated', hyphenationLang=hyphenator)
> Paragraph.__init__(self,text,style, **kwds)
Ok, no problem at all: I actually tried both approaches (that is, having a
custom Paragraph and tweaking the style), and I could not decide which is
better, so your decision definitely helps ;-)
Can you tell if this is going to land in the upcoming (July?) official
release?
Thanks for your support,
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
le...@metapensiero.it | -- Fortunato Depero, 1929.
_______________________________________________
Thanks again,
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
le...@metapensiero.it | -- Fortunato Depero, 1929.
_______________________________________________
Hyphenation is a really hard problem. Huge kudos to Robin for making
this work.
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.