Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

expl3 question: string processing

55 views
Skip to first unread message

Pandita

unread,
Sep 21, 2010, 8:17:11 AM9/21/10
to
hi! all,

I have just been discussing a feature request for csquotes with Dr.
Lehman and he said, "Any word count
requires string processing and TeX is not very good at that." Since
expl3 is the core language of the forthcoming LaTeX 3, is there any
work afoot on easy and robust tools for string processing tasks such
as counting words? Can we expect good news from these quarters?

cheers,

Ven. Pandita

Robin Fairbairns

unread,
Sep 21, 2010, 10:47:23 AM9/21/10
to
Pandita <venpa...@yahoo.com> writes:

unfortunately, what philipp l. said is true. it's also compounded by
the fact that there are 4 different engines expl3 might be running on
(at least, in today's view of the world): etex, pdfetex, xelatax, luatex

for example pdf(e)tex does provide \pdfstrcmp (compare strings), and
there's supposedly a version of that primitive in xetex, but in luatex
it will probably need to be implemented in a lua callback. (there's
been a discussion about this issue on the project list just recently.)

if latex 3 had the luxury of writing all code for just one engine, life
would be a lot easier, but it would *still* require extensions (such as
\pdfstrcmp).

so, as presently envisaged, i don't think there's much chance of extra
text processing support. if, however, you're using luatex, the game
(potentially) changes -- but it would change just the same if you were
using latex 2e, since the change comes from the engine you're using.
--
Robin Fairbairns, Cambridge

Heiko Oberdiek

unread,
Sep 21, 2010, 3:41:28 PM9/21/10
to
Robin Fairbairns <rf...@sxp10.cl.cam.ac.uk> wrote:

> for example pdf(e)tex does provide \pdfstrcmp (compare strings), and
> there's supposedly a version of that primitive in xetex, but in luatex
> it will probably need to be implemented in a lua callback. (there's
> been a discussion about this issue on the project list just recently.)

Package `pdftexcmds' adds many of the missing pdfTeX primitives for
LuaTeX, see \pdf@strcmp.

--
Heiko Oberdiek

Joseph Wright

unread,
Sep 21, 2010, 3:49:21 PM9/21/10
to

Indeed, expl3 loads pdftexcmds when used with LuaTeX (as a package),
rather than reinvent the wheel, so that \pdfstrcmp is available. The
same code is also built into the format version of the code. (XeTeX does
have \strcmp: I know because I did a lot of the leg work to find the
relevant lines in pdfTeX and send them to Jonathan Kew for adding to XeTeX!)
--
Joseph Wright

Joseph Wright

unread,
Sep 21, 2010, 4:04:05 PM9/21/10
to
On 21/09/2010 15:47, Robin Fairbairns wrote:
> Pandita<venpa...@yahoo.com> writes:
>
>> I have just been discussing a feature request for csquotes with Dr.
>> Lehman and he said, "Any word count
>> requires string processing and TeX is not very good at that." Since
>> expl3 is the core language of the forthcoming LaTeX 3, is there any
>> work afoot on easy and robust tools for string processing tasks such
>> as counting words? Can we expect good news from these quarters?
>
> unfortunately, what philipp l. said is true.

I'd also agree here. However, I'm intrigued as to your particular
requirements. Can you give an example of what you'd like.

> for example pdf(e)tex does provide \pdfstrcmp (compare strings), and
> there's supposedly a version of that primitive in xetex, but in luatex
> it will probably need to be implemented in a lua callback. (there's
> been a discussion about this issue on the project list just recently.)
>
> if latex 3 had the luxury of writing all code for just one engine, life
> would be a lot easier, but it would *still* require extensions (such as
> \pdfstrcmp).

There has been some very recent discussion on exactly this: I've just
posted an e-mail to the LaTeX-L list about \pdfstrcmp. The result is
that for a number of reasons LaTeX3/expl3 does now require an engine
which provides this (pdfTeX 1.30+, XeTeX 0.9994+, LuaTeX 0.40+). That
may make life a little easier, but until someone (um, me I guess) writes
a string module we won't know.
--
Joseph Wright

Ulrich D i e z

unread,
Sep 22, 2010, 9:28:39 AM9/22/10
to
Pandita wrote:

> is there any
> work afoot on easy and robust tools for string processing tasks such
> as counting words?

It is not clear to me what you mean by "tools for...tasks such as..."
Does this refer to routines which in a "superordinate-context"
accomplish the entire task (e.g., of counting words) completely,
or does this refer to a set of (sub-)routines which facilitate the
task of writing what oneself considers a "sufficient routine for
the current case" ?

I think automatizing word-counting might turn out to be a
not-so-easy task.

I think that - beneath other aspects - automatic word-counting
requires algorithms for detecting
- which sequences of bits and bytes belong to a single
writeable/drawable "expression for a concept".
(Here a question might be what to consider a single writeable/
drawable "expression for a concept" ---e.g., how to treat
hyphen-compound words and the like?)
- whether sequences of bits and bytes represent words at all or
whether these sequences represent other kinds of writeable/
drawable "expressions for a concept" [e.g., acronyms or
abbreviations; chinese writing; visible algebraic signs;
smileys ;-) ; other ways of combining visible/writeable/drawable
signs for symbolizing concepts].

Where can I learn more about such algorithms?

Did somebody put into words what is to be considered a "word" in
the context of "word-counting"?

Ulrich

Pandita

unread,
Sep 22, 2010, 1:37:50 PM9/22/10
to
On Sep 22, 1:04 am, Joseph Wright <joseph.wri...@morningstar2.co.uk>
wrote:

> On 21/09/2010 15:47, Robin Fairbairns wrote:

> > Pandita<venpand...@yahoo.com>  writes:


>
> >> I have just been discussing a feature request for csquotes with Dr. Lehman and he said, "Any word count requires string processing and TeX is not very good at that."

> > unfortunately, what philipp l. said is true. I'd also agree here. However, I'm intrigued as to your particular requirements. Can you give an example of what you'd like.

csquotes has some "blockquote" commands which will format quotations
in inline or separate block form depending upon one user-defined
"block threshold", which is a certain number of lines. The default is
(3) lines. Thus with the default choice, if the quote is more than (3)
lines, it will become a separate block while it will be set inline if
less than 3. My request is for a package option of block threshold as
the number of words in the quote rather than the actual numbers of
lines. So what we need here is a routine to count the words in a given
quote.

Not that I can really make use of a string module even if it is
available since I am only an ordinary user. I only hope it would make
the life easier for package authors like Dr. Lehman.

Ven. Pandita

Philipp Lehman

unread,
Sep 22, 2010, 3:41:39 PM9/22/10
to
Pandita wrote:

> Not that I can really make use of a string module even if it is
> available since I am only an ordinary user. I only hope it would
> make the life easier for package authors like Dr. Lehman.

I'm not going to write the beast but I can provide you or anyone
interested in tackling the problem with a basic starting point. If I
were to implement that, I'd go for a two-pass parser: drop any
material to be exluded from word counting on the first pass and count
words on the second. I'm attaching a quickly hacked-up proof of
concept of the second step below. It's *very* basic.

Both steps are tricky (the first one even more so). You need to handle
things like the following by dropping anything that is not a "word":

\wordcountquote{word $a^2 + b^2 = c^2$ word}
\wordcountquote{word\footnote{note note} word}

Lots of scenarios here...

You also need to define what a "word" is (in a technical sense). My
code below assumes that a word is any arbitrary string between spaces,
where a "space" may be "~" or any number of literal space tokens.

Since you inquired about higher-level string processing facilities, I
think the only thing which would be of any use are substring matches
based on regular expressions, especially when preprocessing the
string. I don't know Lua, but from a quick glance at the reference
manual I'd say that "string.find" might be useful. It does seem to
support at least basic regular expressions (I didn't find anything
about advanced regex matching with positive/negative look-ahead, but I
may have missed that).

So here's some code. There are a thousand ways for it to fail.

\documentclass{article}
\usepackage{csquotes}
\makeatletter

% word counting

\newcounter{wordthreshold}
\newif\ifwordcountdebug
\newcount\lq@wordcount

% syntax: \iflongquote{<text>}{<true>}{<false>}
%
% true: word count > \value{wordthreshold}
% false: word count <= \value{wordthreshold}

\newcommand*{\iflongquote}{%
\begingroup
\lq@wordcount\z@
\lq@preprocess}

\long\def\lq@preprocess#1{%
\def\lq@text{#1}%
% to do: perform preprocessing of #1 here,
% drop anything to be exluded from
% word counting; return result in \lq@text
\expandafter\lq@parse\lq@text\@end}

\def\lq@parse{%
\futurelet\@let@token\lq@parse@i}

\long\def\lq@parse@i{%
\ifx\@end\@let@token
\advance\lq@wordcount\@ne
\lq@goto\lq@end
\fi
\ifx\@sptoken\@let@token
\advance\lq@wordcount\@ne
\lq@goto\lq@space
\fi
\ifx~\@let@token
\advance\lq@wordcount\@ne
\lq@goto\lq@next
\fi
\ifx\bgroup\@let@token
\lq@goto\lq@group
\fi
\if\relax\noexpand\@let@token
\lq@goto\lq@next
\fi
% to do: add more tests, be smarter
\iftrue
\lq@goto\lq@next
\fi
\@goto}

\def\lq@goto#1#2\@goto{\fi#1}
\long\def\lq@space#1{\lq@parse#1}
\long\def\lq@next#1{\lq@parse}
\long\def\lq@group#1{\lq@parse#1}

\def\lq@end\@end{%
\ifwordcountdebug
[\number\lq@wordcount\space words]\space
\fi
\expandafter\endgroup
\ifnum\lq@wordcount>\c@wordthreshold
\expandafter\@firstoftwo
\else
\expandafter\@secondoftwo
\fi}

% csquotes interface

\newcommand*{\wordcountquote}{%
\@getoptargs\@wordcountquote}

\long\def\@wordcountquote#1#2#3{%
\iflongquote{#3}
{\begin{displayquote}[#1][#2]#3\end{displayquote}}
{\textquote[#1][#2]{#3}}}

\def\@getoptargs#1{%
\@ifnextchar[%]
{\@getoptargs@i{#1}}
{\@getoptargs@i{#1}[]}}
\long\def\@getoptargs@i#1[#2]{%
\@ifnextchar[%]
{\@getoptargs@ii{#1}{#2}}
{\@getoptargs@ii{#1}{#2}[]}}
\long\def\@getoptargs@ii#1#2[#3]{#1{#2}{#3}}

\makeatother
\setlength\parindent{0pt}
\begin{document}

\setcounter{wordthreshold}{10}

\section*{Testing}

\wordcountdebugtrue

\iflongquote{This is a quote.}{block}{inline}

\iflongquote{This is a~quote.}{block}{inline}

\iflongquote{This is a \emph{short quote}.}{block}{inline}

\iflongquote{This is a~\emph{short~quote}.}{block}{inline}

\iflongquote{This is a quote. With a low threshold, it will be typeset
as a separate paragraph.}{block}{inline}

\iflongquote{This is a~quote. With a~low threshold, it will be typeset
as~a separate paragraph.}{block}{inline}

\iflongquote{This is a somewhat \emph{longer quote}. With a low
threshold, it will be typeset as a separate paragraph.}{block}{inline}

\iflongquote{This is~a somewhat \emph{longer quote}. With~a low
threshold, it will~be typeset as~a separate paragraph.}{block}{inline}

\wordcountdebugfalse

\section*{Quoting}

\wordcountquote[citation]{This is a quote.}

\wordcountquote[citation][.]{This is a~\emph{short~quote}}

\wordcountquote[citation]{This is a~quote. With a~low threshold, it
will be typeset as~a separate paragraph.}

\wordcountquote[citation][.]{This is~a somewhat \emph{longer quote}.
With~a low threshold, it will~be typeset as~a separate paragraph}

\end{document}

--
Sender address blackholed, do not reply directly.
You can still reach me by email at: lehman gmx net.

Will Robertson

unread,
Sep 22, 2010, 9:57:55 PM9/22/10
to
[Diverging from the thread yet again.]

On 2010-09-23 05:11:39 +0930, Philipp Lehman
<devnull....@spamgourmet.com> said:

> I don't know Lua, but from a quick glance at the reference
> manual I'd say that "string.find" might be useful. It does seem to
> support at least basic regular expressions (I didn't find anything
> about advanced regex matching with positive/negative look-ahead, but I
> may have missed that).

My copy of Programming in Lua says:

Unlike several other scripting languages, Lua uses neither POSIX
(regexp) nor Perl regular expressions for pattern matching. The main
reason for this decision is size: a typical implementation of POSIX
regular expressions takes more than 4000 lines of code. This is about
the size of all Lua standard libraries together. In comparison, the
implementation of pattern matching in Lua has less than 500 lines. Of
course, the pattern matching in Lua cannot do all that a full POSIX
implementation does. Nevertheless, pattern matching in Lua is a
powerful tool, and includes some features that are difficult to match
with standard POSIX implementations.

So "regex" as you know it isn't available, but it's not the end of the
world. No doubt someone's written a POSIX library for Lua if you happen
to need it. Some more discussion here:

http://stackoverflow.com/questions/2693334/lua-pattern-matching-vs-regular-expressions

Cheers,
Will


Thomas A. Schmitz

unread,
Sep 23, 2010, 1:54:47 AM9/23/10
to
Will Robertson <wsp...@gmail.com> writes:

Diverging even further from the diversion: the luatex binary has
lua's lpeg (http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html)
baked in. It's a powerful beast and can do more than any
regex-based engine, or so I hear - it's too complex for my little
non-math, non-programming understanding.

Best

Thomas

ashinpan

unread,
Sep 23, 2010, 10:46:12 AM9/23/10
to
On Sep 23, 12:41 am, Philipp Lehman <devnull.1.leh...@spamgourmet.com>
wrote:

> I'm not going to write the beast but I can provide you or anyone
> interested in tackling the problem with a basic starting point. If I
> were to implement that, I'd go for a two-pass parser: drop any
> material to be exluded from word counting on the first pass and count
> words on the second. I'm attaching a quickly hacked-up proof of
> concept of the second step below. It's *very* basic.

Thanks a lot for the code. I will try to use it and if anything goes
wrong, I will try to correct it as much as I can. Only when I am
entirely lost that I will post a question here.

That said, I still hope that someone will implement a non-TeX module
for such tasks. Perhaps as a luatex package since luatex is supposed
to merge with pdftex sometime in the future.

cheers,

Ven. Pandita

Philipp Lehman

unread,
Sep 23, 2010, 12:04:32 PM9/23/10
to
> That said, I still hope that someone will implement a non-TeX module
> for such tasks. Perhaps as a luatex package since luatex is supposed
> to merge with pdftex sometime in the future.

Just to be clear: a Lua-based parser would probably be more compact
and you could handle a larger number of cases more easily, but it
still boils down to checking for a limited number of substrings which
are to be excluded from the word count. The tricky part are not the
substrings you count but the substrings you don't count. A parser
using pattern matching would be more powerfull but it still wouldn't
be an all-purpose solution. Since you're parsing the strings on the
macro level, the parser will only work in a limited number of cases.
If you add more cases, the code will eventually turn into a
monstrosity, no matter if it's implemented in TeX or in Lua.

Having said that, the parser may be good enough when used in a
controlled environment. I'm attaching an improved version below which
performs counting and skipping in a single pass. It ignores inline
math and can be configured to ignore (certain) control sequences. Note
that it doesn't count words, it counts word separators (space tokens,
"~", "\ ", "\space") and adds 1 to the result. Control sequences are
skipped, but unbreakable spaces, control spaces, and so on after a
control sequence count as a word separator. Arguments to macros are
counted as well. E.g.:

"word \foobar word" -> 2 words
"word \foobar~word" -> 3 words
"word \foobar\ word" -> 3 words
"word \emph{emph emph} word" -> 4 words
"word \emph {emph emph} word" -> 4 words

It can be configured to ignore macros with one mandatory or with one
optional plus one mandatory argument (i.e., both the macro and the
arguments). It's preconfigured to ignore \footnote and the footnote
text, e.g.:

"word\foonote{note note note} word" -> 2 words

You can add more exceptions which satisfy these syntactical
requirements by appending them to the \ignorecommands list. For
example, \index macros and endnotes would be added as follows:

\appto\ignorecommands{%
\index
\endnote
\endnotemark
}

Here's the second iteration:

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{csquotes}
\makeatletter

% word counting

\newcounter{wordthreshold}
\newif\ifwordcountdebug
\newcount\lq@wordcount

% syntax: \iflongquote{<text>}{<true>}{<false>}
%
% true: word count > \value{wordthreshold}
% false: word count <= \value{wordthreshold}

\newcommand{\iflongquote}[1]{%
\begingroup
\lq@wordcount\z@
\lq@parse#1\lg@endhere}

\def\lq@parse{%
\futurelet\@let@token\lq@parse@i}

\long\def\lq@parse@i{%
\ifx\lg@endhere\@let@token


\advance\lq@wordcount\@ne
\lq@goto\lq@end
\fi
\ifx\@sptoken\@let@token
\advance\lq@wordcount\@ne
\lq@goto\lq@space
\fi
\ifx~\@let@token
\advance\lq@wordcount\@ne
\lq@goto\lq@next
\fi

\ifx\ \@let@token


\advance\lq@wordcount\@ne
\lq@goto\lq@next
\fi

\ifx\space\@let@token


\advance\lq@wordcount\@ne
\lq@goto\lq@next
\fi
\ifx\bgroup\@let@token
\lq@goto\lq@group
\fi

\ifx$\@let@token
\lq@goto\lq@math


\fi
\if\relax\noexpand\@let@token

\lq@goto\lq@parse@cs
\fi
\expandafter\lq@next\@gobble
\lq@gohere}

\def\lq@parse@cs{%
\expandafter\lq@parse@cs@i\ignorecommands\lq@stophere}

\def\lq@parse@cs@i#1{%
\ifx#1\lq@stophere
\lq@goto\lq@next
\fi
\ifx#1\@let@token
\lq@goto{\lq@stop\lq@cmd}%
\fi
\expandafter\lq@parse@cs@i\@gobble
\lq@gohere}

\def\lq@goto#1#2\lq@gohere{\fi#1}
\def\lq@stop#1#2\lq@stophere{#1}
\def\lg@endhere{\lg@endhere}
\def\lq@stophere{\lq@stophere}

\long\def\lq@space#1{\lq@parse#1}
\long\def\lq@next#1{\lq@parse}
\long\def\lq@group#1{\lq@parse#1}

\long\def\lq@math$#1${\lq@space}

\def\lq@cmd#1{\lq@cmd@i}
\newcommand{\lq@cmd@i}[2][]{\lq@parse}
\newcommand*{\ignorecommands}{}

\def\lq@end\lg@endhere{%


\ifwordcountdebug
[\number\lq@wordcount\space words]\space
\fi
\expandafter\endgroup
\ifnum\lq@wordcount>\c@wordthreshold
\expandafter\@firstoftwo
\else
\expandafter\@secondoftwo
\fi}

% csquotes interface

\newcommand*{\longquote}{%
\lq@getoptargs\lq@longquote}

\long\def\lq@longquote#1#2#3{%
\iflongquote{#3}
{\begin{displayquote}[#1][#2]%
#3%


\end{displayquote}}
{\textquote[#1][#2]{#3}}}

\def\lq@getoptargs#1{%
\@ifnextchar[%]
{\lq@getoptargs@i{#1}}
{\lq@getoptargs@i{#1}[]}}
\long\def\lq@getoptargs@i#1[#2]{%
\@ifnextchar[%]
{\lq@getoptargs@ii{#1}{#2}}
{\lq@getoptargs@ii{#1}{#2}[]}}
\long\def\lq@getoptargs@ii#1#2[#3]{#1{#2}{#3}}

\makeatother

% configuration

\setcounter{wordthreshold}{10}

\appto\ignorecommands{%
\footnote
\footnotemark
}

% Let's go


\setlength\parindent{0pt}
\begin{document}

\section*{Testing}

\wordcountdebugtrue

\iflongquote{Some word and some math: $ a^2 + b^2 = c^2 $ and another
word.}{>\number\value{wordthreshold}}{<=\number\value{wordthreshold}}

\iflongquote{Some control-sequence: \LaTeX\ and a word makes 8}
{>\number\value{wordthreshold}}{<=\number\value{wordthreshold}}

\iflongquote{Some control-sequence: \LaTeX, plus a word makes 8}
{>\number\value{wordthreshold}}{<=\number\value{wordthreshold}}

\iflongquote{Some word and a footnote\footnote{This is a footnote. We
don't count words in footnotes.} and another word.}
{>\number\value{wordthreshold}}{<=\number\value{wordthreshold}}

\iflongquote{This is a quote.}{>\number\value{wordthreshold}}
{<=\number\value{wordthreshold}}

\iflongquote{This is a~quote.}{>\number\value{wordthreshold}}
{<=\number\value{wordthreshold}}

\iflongquote{This is a \emph{short quote}.}

{>\number\value{wordthreshold}}{<=\number\value{wordthreshold}}

\iflongquote{This is a~\emph {short~quote}.}

{>\number\value{wordthreshold}}{<=\number\value{wordthreshold}}

\iflongquote{This is a quote. With a low threshold, it will be typeset

as a separate paragraph.}{>\number\value{wordthreshold}}
{<=\number\value{wordthreshold}}

\iflongquote{This is a~quote. With a~low threshold, it will be typeset

as~a separate paragraph.}{>\number\value{wordthreshold}}
{<=\number\value{wordthreshold}}

\iflongquote{This is a somewhat \emph{longer quote}. With a low
threshold, it will be typeset as a separate paragraph.}

{>\number\value{wordthreshold}}{<=\number\value{wordthreshold}}

\iflongquote{This is~a somewhat \emph{longer quote}. With~a low
threshold, it will~be typeset as~a separate paragraph.}

{>\number\value{wordthreshold}}{<=\number\value{wordthreshold}}

\wordcountdebugfalse

\section*{Quoting}

\longquote[citation]{This is a quote.}

\longquote[citation][.]{This is a~\emph{short~quote}}

\longquote[citation]{This is a~quote. With a~low threshold, it will be
typeset as~a separate paragraph.}

\longquote[citation][.]{This is~a somewhat \emph{longer quote}. With~a

zappathustra

unread,
Sep 23, 2010, 12:17:46 PM9/23/10
to
On 23 sep, 07:54, thomas.nospam.schm...@uni-bonn.de (Thomas A.

Schmitz) wrote:
> Will Robertson <wsp...@gmail.com> writes:
> > [Diverging from the thread yet again.]   On 2010-09-23 05:11:39

Note that, if we're talking word count in a typeset paragraph, which I
suppose is the case here, we can use LuaTeX without any kind of regex:
provided with equate words with spaces (a word is followed by a
space), just count spaces in the pre_linebreak_filter:

\directlua{%
words = 0
word_count = function(head)
for space in node.traverse_id(10, head) do
word = word + 1
end
return head
end
callback.register("pre_linebreak_filter", word_count)
}

That method has many drawbacks, which stem from the definition of a
word as followed by a space. But you can improve on it: add hyphens if
you think compound words should be counted as several words, add kerns
(except font kerns), etc. A more accurate (and simpler) method would
count glyphs.

[Note for Pandita: LuaTeX is merged with pdfTeX; it is actually an
extension.]

Best,
Paul

Philipp Lehman

unread,
Sep 23, 2010, 12:48:11 PM9/23/10
to
zappathustra wrote:

> if we're talking word count in a typeset paragraph

I'm afraid we're talking about counting words prior to typesetting
because the word count will determine how to typeset the paragraph.

zappathustra

unread,
Sep 23, 2010, 1:13:29 PM9/23/10
to
On 23 sep, 18:48, Philipp Lehman <devnull.1.leh...@spamgourmet.com>
wrote:

> zappathustra wrote:
> > if we're talking word count in a typeset paragraph
>
> I'm afraid we're talking about counting words prior to typesetting
> because the word count will determine how to typeset the paragraph.

Oh. But if that doesn't influence font (e.g. size change), then the
solution still works, since the pre_linebreak_filter is used before
the paragraph is built (it is a list of nodes).

Paul

Philipp Lehman

unread,
Sep 23, 2010, 1:19:41 PM9/23/10
to
zappathustra wrote:

> Oh. But if that doesn't influence font (e.g. size change), then the
> solution still works, since the pre_linebreak_filter is used before
> the paragraph is built (it is a list of nodes).

Hhm, can you do that in a \vbox, too? That would be helpful. Note that
we get the text as an argument. We can feed it to a parser or put it
in a box.

\newcounter{wordthreshold}
\setcounter{wordthreshold}{40}
\ifgreaterthanthreshold{<text>}{<true>}{<false>}

zappathustra

unread,
Sep 23, 2010, 2:58:27 PM9/23/10
to
On 23 sep, 19:19, Philipp Lehman <devnull.1.leh...@spamgourmet.com>
wrote:

> zappathustra wrote:
> > Oh. But if that doesn't influence font (e.g. size change), then the
> > solution still works, since the pre_linebreak_filter is used before
> > the paragraph is built (it is a list of nodes).
>
> Hhm, can you do that in a \vbox, too? That would be helpful. Note that
> we get the text as an argument. We can feed it to a parser or put it
> in a box.
>
> \newcounter{wordthreshold}
> \setcounter{wordthreshold}{40}
> \ifgreaterthanthreshold{<text>}{<true>}{<false>}


Now that I've understood better what's looked for, perhaps the
following would work?

%%%%%

\def\setwordthreshold#1{%
\directlua{wordthreshold = #1}%
}
\def\first#1#2{#1}
\def\second#1#2{#2}
\newbox\quotebox
\def\ifgreaterthanthreshold#1{%
\setbox\quotebox=\hbox{#1}%
\directlua{%
local words = 1
for space in node.traverse_id(10, tex.box[\the\quotebox].list) do
words = words + 1
end
if words < wordthreshold then
tex.print("\luaescapestring{\noexpand\first}")
else
tex.print("\luaescapestring{\noexpand\second}")
end}%
}

\def\Quote#1{%
\ifgreaterthanthreshold{#1}
{\unhbox\quotebox}
{\vskip\baselineskip\vbox{\unhbox\quotebox}\vskip\baselineskip}%
}
\setwordthreshold{6}


zero \Quote{one two three four five six} seven

zero \Quote{one two three four five} six

\bye

%%%%%

Various improvements could be made, of course, that's just a skeleton.
The "words" counter starts at 1, because the last word is probably not
followed by a space (so the argument should be trimmed).

I don't use callbacks, which avoids unnecessary complications, since
we can use a box.

Paul

Robert

unread,
Sep 23, 2010, 6:28:38 PM9/23/10
to
On 23.09.10 19:19, Philipp Lehman wrote:
> \ifgreaterthanthreshold{<text>}{<true>}{<false>}

What about this: set the text twice with different interword spaces, the
difference of the boxes' widths plus one is the number of words. So
roughly:

\def\ifgreaterthanthreshold#1{%
\begingroup
\spaceskip =2sp
\xspaceskip=2sp
\setbox0\hbox{#1}%
\@tempdima=\wd0
\spaceskip =1sp
\xspaceskip=1sp
\setbox0\hbox{#1}%
\advance\@tempdima-\wd0
\advance\@tempdima 1sp
\ifnum\@tempdima>\c@wordthreshold
\aftergroup\@firstoftwo
\else
\aftergroup\@secondoftwo
\fi
\endgroup
}

(OK, a math group will be counted as one word.)

Regards,
--
Robert

Will Robertson

unread,
Sep 23, 2010, 8:40:24 PM9/23/10
to
On 2010-09-24 07:58:38 +0930, Robert <w....@gmx.net> said:

> On 23.09.10 19:19, Philipp Lehman wrote:
>> \ifgreaterthanthreshold{<text>}{<true>}{<false>}
>
> What about this: set the text twice with different interword spaces,
> the difference of the boxes' widths plus one is the number of words.

Cute. I like it; straightforward and deals with "invisible material"
sensibly. Although I guess things like \footnote{...} don't end up
being counted correctly. Depends on what you're looking for, I guess.

And of course you get the old problem of double-typesetting, which
really should have a proper wrapper so incrementors, etc., can turn
themselves off when necessary.

Anyway, a small addition: you want to adjust the meanings of \, \quad
etc. so they appear as a "space" for stretching/counting purposes (see
appended). I probably missed a few.

Cheers,
Will

\documentclass{article}
\begin{document}
\makeatletter
\def\countwords#1{%
\begingroup
\def\do##1{\let##1\space}%
\do\thinspace\do\quad\do\qquad
\do\enspace\do\enskip\do\hfill
\def\hspace##1{\space}%
\spaceskip =2sp %
\xspaceskip=2sp %


\setbox0\hbox{#1}%
\@tempdima=\wd0

\spaceskip =1sp %
\xspaceskip=1sp %


\setbox0\hbox{#1}%
\advance\@tempdima-\wd0

\advance\@tempdima 1sp %
\number\@tempdima
\endgroup
}
\countwords{foo bar baz}
\countwords{foo\space bar\space baz}
\countwords{foo~bar~baz}
\countwords{foo\nobreakspace bar\nobreakspace baz}
\countwords{foo\,bar\,baz}
\countwords{foo\quad bar\quad baz}
\countwords{foo\hspace{2pt}bar\hspace{2pt}baz}
\countwords{foo\hfill bar\hfill baz}
\end{document}

Donald Arseneau

unread,
Sep 23, 2010, 8:57:21 PM9/23/10
to
Philipp Lehman <devnull....@spamgourmet.com> writes:

> I'm afraid we're talking about counting words prior to typesetting
> because the word count will determine how to typeset the paragraph.

But you can always pre-typeset the paragraph to measure, then
re-typeset it to spec.

I suggest


\setbox\@tempboxa\vbox{\parindent=0pt
\hsize=0pt
\hyphenpenalty=10000
\exhyphenpenalty=0
\hfuzz=\maxdimen
\def\par{ }% in case \par is in text
<paragraph text>
\endgraf
\global\numwords=\prevgraf
}


--
Donald Arseneau as...@triumf.ca

Philipp Lehman

unread,
Sep 24, 2010, 6:03:18 AM9/24/10
to
Robert wrote:

> What about this: set the text twice with different interword spaces,
> the difference of the boxes' widths plus one is the number of words.

Neat idea!

> (OK, a math group will be counted as one word.)

Which seems like a reasonable way to deal with it.

Philipp Lehman

unread,
Sep 24, 2010, 9:39:16 AM9/24/10
to
Donald Arseneau wrote:

> Philipp Lehman writes:
>> I'm afraid we're talking about counting words prior to typesetting
>> because the word count will determine how to typeset the paragraph.
>
> But you can always pre-typeset the paragraph to measure, then
> re-typeset it to spec.

Sure. That's in fact what I was trying to say.

> I suggest
>
> \setbox\@tempboxa\vbox{\parindent=0pt
> \hsize=0pt
> \hyphenpenalty=10000
> \exhyphenpenalty=0
> \hfuzz=\maxdimen
> \def\par{ }% in case \par is in text
> <paragraph text>
> \endgraf
> \global\numwords=\prevgraf
> }

Very nice. This seems to be the most robust solution suggested.
I'd merely add

\let~\space
\relpenalty\@M
\binoppenalty\@M
\let\allowbreak\relax

Philipp Lehman

unread,
Sep 24, 2010, 9:44:08 AM9/24/10
to
Will Robertson wrote:

> Cute. I like it; straightforward and deals with "invisible material"
> sensibly. Although I guess things like \footnote{...} don't end up
> being counted correctly.

Footnotes will work fine. The code will see the width of the footnote
mark, which is the same in both cases.

> And of course you get the old problem of double-typesetting, which
> really should have a proper wrapper so incrementors, etc., can turn
> themselves off when necessary.

Sure, but grouping, \@fileswfalse, checkpointing LaTeX counters plus
redefining things like \index usually works well.

Will Robertson

unread,
Sep 26, 2010, 10:16:02 AM9/26/10
to
On 2010-09-24 23:14:08 +0930, Philipp Lehman
<devnull....@spamgourmet.com> said:

> Will Robertson wrote:
>
>> Cute. I like it; straightforward and deals with "invisible material"
>> sensibly. Although I guess things like \footnote{...} don't end up
>> being counted correctly.
>
> Footnotes will work fine. The code will see the width of the footnote
> mark, which is the same in both cases.

Or do you want to count the words inside the footnote, too? In which
case, \renewcommand\footnote[2][]{#2}.

>> And of course you get the old problem of double-typesetting, which
>> really should have a proper wrapper so incrementors, etc., can turn
>> themselves off when necessary.
>
> Sure, but grouping, \@fileswfalse, checkpointing LaTeX counters plus
> redefining things like \index usually works well.

Yes, works well, but I'd like there to be a standard solution to the problem :)

Cheers,
Will

0 new messages