Auto-smallcaps filter

153 views
Skip to first unread message

Gwern Branwen

unread,
Feb 19, 2020, 3:14:55 PM2/19/20
to pandoc-discuss
I wrote a plugin for my gwern.net Hakyll script
(https://www.gwern.net/hakyll.hs) which was slightly tricky, and so
might be of interest.

Bringhurst & other typographers recommend using small-caps for
acronyms/initials of 3 or more capital letters because with full
capitals, they look too big and dominate the page (eg Bringhurst 2004,
_Elements_ pg47; cf https://en.wikipedia.org/wiki/Small_caps#Uses
http://theworldsgreatestbook.com/book-design-part-5/
http://webtypography.net/3.2.2 )

This can be done by hand in Pandoc by using the span syntax like
`[ABC]{.smallcaps}`, but quickly grows tedious. It can also be done
reasonably easily with a query-replace regexp eg in Emacs
`(query-replace-regexp "\\([^>]\\)\\(\\\".*?\\\"\\)" "\\1<q>\\2</q>"
nil begin end)`, but still must be done manually because while almost
all uses in regular text can be smallcaps-fied, a blind regexp will
wreck a ton of things like URLs & tooltips, code blocks, etc.

However, if we walk a Pandoc AST and check for only acronyms/initials
inside a `Str`, where they *can't* be part of a `Link` or `CodeBlock`,
then looking over gwern.net ASTs, they seem to always be safe to
substitute in `SmallCaps` elements. Unfortunately, we can't use the
regular `Inline -> Inline` replacement pattern because `SmallCaps`
takes a `[Inline]` argument, and so we are doing `Str String ->
SmallCaps [Inline]` and changing the size/type.

So we instead walk the Pandoc AST, use a regexp to split on 3+ capital
letters, `SmallCaps` the matched text, and append recursively, and
return the concatenated results.
`bottomUp` is slower than `walk` but appears to be necessary here for
greedy generation; `walk` will do only *some* substitutions, which has
something to do with its tree traversal method, I think? (Regardless,
`smallcapsfy` doesn't seem to add *too* much overhead.)

The final code:

import Text.Pandoc
import Text.Regex.Posix ((=~))

smallcapsfy :: [Inline] -> [Inline]
smallcapsfy ((Str []):[]) = []
-- why `::String` on the regexp pattern? need to specify it
otherwise hakyll.hs OverloadedStrings makes it ambiguous & a type
error
smallcapsfy xs@(Str a : x) = let (before,matched,after) = a =~
("[A-Z][A-Z][A-Z]+"::String) :: (String,String,String)
in if matched==""
then xs -- no acronym anywhere in x
else [Str before, SmallCaps [Str
matched]] ++ smallcapsfy [Str after] ++ smallcapsfy x
smallcapsfy xs = xs

Regexp examples:

"BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String)
~> ("Big","GAN","")
"BigGANNN BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String)
~> ("Big","GANNN"," BigGAN")
"NSFW BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String)
~> ("","NSFW"," BigGAN")
"BigGANNN BigGAN" =~ "[A-Z][A-Z][A-Z]" :: (String,String,String)
~> ("Big","GAN","NN BigGAN")
"biggan means big" =~ "[A-Z][A-Z][A-Z]" :: (String,String,String)
~> ("biggan means big","","")

Function examples:

smallcaps [Str "BigGAN"]
~> [Str "Big",SmallCaps [Str "GAN"]]
smallcaps [Str "BigGANNN means big"]
~> [Str "Big",SmallCaps [Str "GANNN"],Str " means big"]
smallcaps [Str "biggan means big"]
~> [Str "biggan means big"]

Whole-document examples:

bottomUp smallcapsfy [Str "bigGAN means", Emph [Str "BIG"]]
~> [Str "big",SmallCaps [Str "GAN"],Str " means",Emph [Str
"",SmallCaps [Str "BIG"]]]

--
gwern

John MacFarlane

unread,
Feb 20, 2020, 5:58:46 PM2/20/20
to Gwern Branwen, pandoc-discuss

You could use this idiom instead of bottomUp:

walk (concatMap go)

Where 'go' is Inline -> [Inline], 'walk (concatMap go)' is
[Inline] -> [Inline]. This should perform better than
bottomUp.
> --
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gwVEMVMrGrSv3F4qq%3DZSVeWgaq8xZ2PE%2BxKx51GWDKW1w%40mail.gmail.com.

Gwern Branwen

unread,
Feb 20, 2020, 11:12:51 PM2/20/20
to pandoc-discuss
That seems to work, thanks.

--
gwern
https://www.gwern.net

Gwern Branwen

unread,
Apr 7, 2020, 11:13:05 AM4/7/20
to pandoc-discuss
To update this: for HTML output, this code is broken because it
doesn't transform the smallcapsfied phrases into lowercase, and
smallcaps on uppercase is a null op. We need to set a new CSS class,
lowercase it, and then smallcaps it as usual.

For HTML output, this is not enough, because using smallcaps on a
capital letter is a null-op. We *could* just rewrite the capitals to
lowercases with `map toLower` etc, but then that breaks copypaste: the
underlying text for a 'Big[GAN]{.smallcaps}' is now
'[Biggan]{.smallcaps}' etc. So instead of using native SmallCaps AST
elements, we create a new HTML span class for *just* all-caps separate
from the pre-existing standard Pandoci 'smallcaps' CSS class,
'smallcaps-auto'; we annotate capitals with that new class in a Span
rather than SmallCaps, and then in CSS, we do `span.smallcaps-auto {
font-feature-settings: 'smcp'; text-transform: lowercase; }` -
smallcaps is enabled for this class, but we also lowercase everything,
thereby forcing the intended smallcaps appearance while ensuring that
copy-paste produces 'BigGAN' (as written) instead of 'Biggan'.

Aside from the new CSS declaration specified above, `smallcapsfy` need
to set a Span rather than SmallCaps as follows:

smallcapsfy :: [Inline] -> [Inline]
smallcapsfy = concatMap go
where
go :: Inline -> [Inline]
go (Str []) = []
go x@(Str a) = let (before,matched,after) = a =~
("[A-Z][A-Z][A-Z]+"::String) :: (String,String,String)
in if matched==""
then [x] -- no acronym anywhere in x
else [Str before, Span ("",
["smallcaps-auto"], []) [Str matched]] ++ go (Str after)
go x = [x]

--
gwern
https://www.gwern.net
Reply all
Reply to author
Forward
0 new messages