East Asian languages and ruby text.

474 views
Skip to first unread message

Qubyte

unread,
Jan 19, 2012, 10:59:11 AM1/19/12
to pandoc-discuss
I'm interested in using pandoc to turn my markdown notes on Japanese
into nicely set HTML and (Xe)LaTeX. With HTML5, ruby (typically used
to phonetically read chinese characters by placing text above or to
the side) is standard, and support from browsers is emerging (Webkit
based browsers appear to fully support it). For those browsers that
don't support it yet (notably Firefox) the feature falls back in a
nice way by placing the phonetic reading inside brackets to the side
of each Chinese character, which is suitable for other output formats
too. As for (Xe)LaTeX, ruby is not an issue.

At the moment, I use inline HTML to achieve the result when the
conversion is to HTML, but it's ugly and uses a lot of keystrokes, for
example

<ruby>ご<rt></rt>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby>

sets ご飯 "gohan" with "han" spelt phonetically above the second
character, or to the right of it in brackets if the browser does not
support ruby.

I'd like to have something more like

r[はん](飯)

or any keystroke saving convention would be welcome.

I'm fairly certain that pandoc doesn't support ruby text, but please
correct me if I'm wrong. What are the chances of extending pandoc to
support ruby? Is this a huge task, or simply a matter of adding a
definition somewhere? If the latter is true, I can poke around and do
it myself. My apologies, I don't know the internals of pandoc so well.

fiddlosopher

unread,
Jan 20, 2012, 9:33:01 PM1/20/12
to pandoc-discuss
On Jan 19, 7:59 am, Qubyte <mark.s.ever...@gmail.com> wrote:
> I'm interested in using pandoc to turn my markdown notes on Japanese
> into nicely set HTML and (Xe)LaTeX. With HTML5, ruby (typically used
> to phonetically read chinese characters by placing text above or to
> the side) is standard, and support from browsers is emerging (Webkit
> based browsers appear to fully support it). For those browsers that
> don't support it yet (notably Firefox) the feature falls back in a
> nice way by placing the phonetic reading inside brackets to the side
> of each Chinese character, which is suitable for other output formats
> too. As for (Xe)LaTeX, ruby is not an issue.
>
> At the moment, I use inline HTML to achieve the result when the
> conversion is to HTML, but it's ugly and uses a lot of keystrokes, for
> example
>
> <ruby>ご<rt></rt>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby>
>
> sets ご飯 "gohan" with "han" spelt phonetically above the second
> character, or to the right of it in brackets if the browser does not
> support ruby.
>
> I'd like to have something more like
>
> r[はん](飯)
>
> or any keystroke saving convention would be welcome.
>
> I'm fairly certain that pandoc doesn't support ruby text, but please
> correct me if I'm wrong.

You're right.

> What are the chances of extending pandoc to
> support ruby?

Very small I'd say. But read on.

> If the latter is true, I can poke around and do
> it myself. My apologies, I don't know the internals of pandoc so well.

You could try writing a script to support this, using the Pandoc API.
See the document "Scripting with pandoc" on the pandoc website. Your
application is a natural fit.

The syntax you suggested could be made to work, but I'd suggest
something
like

[はん](-飯)

Then your script would consist mainly of a function

handleRuby :: Inline -> Inline
handleRuby (Link txt ('-':ruby,_)) = RawInline "html" rubyHtml
where rubyHtml = -- (fill in this part yourself)
handleRuby x = x

I can help further if you get stuck.

John

Qubyte

unread,
Jan 22, 2012, 9:59:08 AM1/22/12
to pandoc-...@googlegroups.com
Hi John,

Thank you for your quick reply, and apologies for the lateness of mine.

You've put me on the track I needed, so I'll read up and see what I can do.

Thanks again!

Mark

Qubyte

unread,
Jan 22, 2012, 12:34:32 PM1/22/12
to pandoc-...@googlegroups.com
Right, here's what I have so far. There may be some daft mistakes as I'm new to Haskell.

-- handleRuby.hs
import Text.Pandoc

handleRuby :: Inline -> Inline
handleRuby (Link txt ('-':kanji,_)) = RawInline "html" rubyHtml
  where rubyHtml = ("<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ "hello" ++ "<rp>)</rp></rt></ruby>")
handleRuby x = x

readDoc :: String -> Pandoc
readDoc = readMarkdown defaultParserState

writeDoc :: Pandoc -> String
writeDoc = writeMarkdown defaultWriterOptions

main :: IO ()
main = interact (writeDoc . bottomUp handleRuby . readDoc)

Where it says hello, I'd like to put the content of the square brackets in "[ruby](-kanji)". How do I extract this?

I may be getting a little ahead now... Is there a way to hand options to this script? I'd like a single script to handle both LaTeX and HTML5 if possible.

Thanks,

Mark 

Michael

unread,
Jan 22, 2012, 4:06:43 PM1/22/12
to pandoc-discuss

> Where it says hello, I'd like to put the content of the square brackets in
"[ruby](-kanji)". How do I extract this?


In the first case you are considering for `handleRuby` -- i.e.

handleRuby (Link txt ('-':kanji,_)) = ...

-- it is inside the bit called `txt`:


$ echo "[ruby](-kanji)" | pandoc -r markdown -w native
[Para [Link [Str "ruby"] ("-kanji","")]]
$ echo "[はん](-飯)" | pandoc -r markdown -w native
[Para [Link [Str "\12399\12435"] ("-\39151","")]]

So in the first case the value of `txt` is `[Str "ruby"]`, and in the
second it's `[Str "\12399\12435"]`. That is, each is a list of pandoc
Inlines. But in the kind of case you are imagining, the list of
Inlines only has one element, always of the form `Str some_string`.

Written without 'sugar', the two inline lists are respectively `Str
"ruby" : []` and `Str "\12399\12435": []` -- each is just an
individual Inline prefixed to the empty list (of Inlines).

In any case, you need to expose the relevant pattern for the first
("ruby") bit of a `Link linked_txt string_pair` as well as the second
("kanji") bit, in order to give special handling to it.

Something like this, maybe:

handleRuby :: Inline -> Inline
handleRuby (Link (Str ruby : []) ('-' : kanji , _)) = RawInline
"html" rubyHtml
where rubyHtml = "<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ ruby +
+ "<rp>)</rp></rt></ruby>"
handleRuby x = x

Or maybe the "pattern matching" is clearer if we just separate out the
rubyHtml part as a function from two strings to a string:

handleRuby :: Inline -> Inline
handleRuby (Link (Str ruby : []) ('-' : kanji , _)) = RawInline
"html" (rubyize ruby kanji)
handleRuby x = x

rubyize :: String -> String -> String
rubyize ruby kanji = "<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ ruby
++ "<rp>)</rp></rt></ruby>"


So, if a pandoc Inline element fits exactly that pattern:

Link (Str ruby : []) ('-' : kanji , _)

for some strings ruby and kanji, it will be converted to the rubyized
raw html; otherwise it fits the second case and will be left alone.
This might need to be more sophisticated depending on what else might
happen in the brackets.

Michael Thompson

unread,
Jan 22, 2012, 4:33:01 PM1/22/12
to pandoc-discuss
Sorry, the lines were a little long for communication by gmail;
http://hpaste.org/56861 will be more readable.

Qubyte

unread,
Jan 22, 2012, 9:48:44 PM1/22/12
to pandoc-...@googlegroups.com
Hi Micheal,

I figured it was to do with txt, but didn't know what to do with it. Your code works perfectly on test cases with latin characters, but I'm having trouble with the Japanese text. For example

    echo "[goodbye](-hello)" | runhaskell handleRuby.hs

yields the expected and

    echo "[あした](-明日)" | runhaskell handleRuby.hs

just hands the original text back. Anticipating the obvious question on what encoding my terminal is using, I tried the same with files using 

    runhaskell handleRuby.hs < input.txt

and got the same result. Also the encoding of the terminal is set to UTF-8. Any ideas what I'm doing wrong?

Thanks for the code, this feels 99% done!

Mark

fiddlosopher

unread,
Jan 22, 2012, 10:15:07 PM1/22/12
to pandoc-discuss
On Jan 22, 6:48 pm, Qubyte <mark.s.ever...@gmail.com> wrote:
> Hi Micheal,
>
> I figured it was to do with txt, but didn't know what to do with it. Your
> code works perfectly on test cases with latin characters, but I'm having
> trouble with the Japanese text. For example
>
>     echo "[goodbye](-hello)" | runhaskell handleRuby.hs
>
> yields the expected and
>
>     echo "[あした](-明日)" | runhaskell handleRuby.hs
>
> just hands the original text back.

Both worked for me. Maybe you're using an older version of pandoc
than I am, which parses the symbols differently. In any case, you
might try the following:

handleRuby :: Inline -> Inline
handleRuby (Link txt ('-' : kanji , _)) = RawInline "html"
rubyHtml
where rubyHtml = "<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++
stringify txt ++ "<rp></rp></rt></ruby>"
handleRuby x = x

stringify is in Text.Pandoc.Shared, so you should also add

import Text.Pandoc.Shared (stringify)

or, instead of 'stringify txt', you could do

writeHtml defaultWriterOptions (readMarkdown defaultParserState
txt)

which would also allow you to have formatting, etc. inside the
brackets.

A general suggestion: instead of writing a markdown -> markdown
converter, as you currently have, I would use the jsonFilter function
and write a converter that reads a json representation of a native
pandoc document, transforms it, and writes the same.

Code would differ only in main:

main = interact $ jsonFilter (bottomUp handleRuby)

You would then use it in converting markdown to HTML as follows:

ghc --make handleRuby # compile it for speed
pandoc -f markdown -t json input.txt | ./handleRuby | pandoc -f
json -t html -o output.html

This allows you to specify command-line options as needed, and to
convert to different output formats.

John

Qubyte

unread,
Jan 22, 2012, 11:28:07 PM1/22/12
to pandoc-...@googlegroups.com
I'll think about json a little later (one step at a time). I was on the macports pandoc, which is a little old (1.8.0.1). I installed the Haskell Platform and got pandoc via cabal to get me up to date.

Running the same script before generates the text:

    <ruby>%E6%98%8E%E6%97%A5<rp>(</rp><rt>あした<rp>)</rp></rt></ruby>

Swapping (Str ruby : []) for ruby and using stringify produces the same result. Any ideas what's behind the garbling?

Thanks again for the help, it's very much appreciated.

John MacFarlane

unread,
Jan 23, 2012, 12:27:12 AM1/23/12
to pandoc-...@googlegroups.com
+++ Qubyte [Jan 22 12 20:28 ]:

> I'll think about json a little later (one step at a time). I was on the
> macports pandoc, which is a little old (1.8.0.1). I installed the
> Haskell Platform and got pandoc via cabal to get me up to date.
>
> Running the same script before generates the text:
>
>
> <ruby>%E6%98%8E%E6%97%A5<rp>(</rp><rt>���<rp>)</rp></rt></ruby>

What version are you running now (pandoc --version)? It appears that the
version you're using URL-encodes the source part of the link; the dev
version doesn't do that. You can transform it back to a regular string
by using unescapeURIString from Network.URI. ('cabal install network' if
you don't have it.)

Also, what is the output of 'locale', and what is the encoding of
your input text?

Qubyte

unread,
Jan 23, 2012, 12:59:24 AM1/23/12
to pandoc-...@googlegroups.com
I'm now running 1.8.2.1. Is the development version installable via cabal?

The output of locale is
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
(Confusingly I'm British so I have my system configured for en_GB primarily, but live in Japan.) 

I believe the encoding of input text is UTF-8. The terminal is configured so, and my text editor is set to use UTF-8 (and everything displays characters correctly).

I installed the network package with cabal, and altered the code to the following:

-- handleRuby.hs
import Text.Pandoc
import Network.URI (unEscapeString)

handleRuby :: Inline -> Inline
handleRuby (Link (Str ruby : []) ('-':kanji,_)) = RawInline "html" rubyHtml
  where rubyHtml = ("<ruby>" ++ unEscapeString kanji ++ "<rp>(</rp><rt>" ++ ruby ++ "<rp>)</rp></rt></ruby>")
handleRuby x = x

readDoc :: String -> Pandoc
readDoc = readMarkdown defaultParserState

writeDoc :: Pandoc -> String
writeDoc = writeMarkdown defaultWriterOptions

main :: IO ()
main = interact (writeDoc . bottomUp handleRuby . readDoc)

However, this gives me a nasty looking error when I attempt to run it:

GHCi runtime linker: fatal error: I found a duplicate definition for symbol
   _hsnet_freeaddrinfo
whilst processing object file
   /Users/mark/Library/Haskell/ghc-7.0.4/lib/network-2.3.0.8/lib/HSnetwork-2.3.0.8.o
This could be caused by:
   * Loading two different object files which export the same symbol
   * Specifying the same object file twice on the GHCi command line
   * An incorrect `package.conf' entry, causing some object to be
     loaded twice.
GHCi cannot safely continue in this situation.  Exiting now.  Sorry.

This install of the Haskell platform is really new (i.e. I haven't poked around in conf files and it's only a few hours old) and the only two things I've asked it to install (I realise that these pull dependencies) are Pandoc and the network package. I'm not having much luck today!

Thanks again.

Qubyte

unread,
Jan 23, 2012, 1:48:11 AM1/23/12
to pandoc-...@googlegroups.com
I got rid of the pandoc I had install via cabal, and installed the development one as per the instructions. Everything appears to be working well! I'll move on to json this evening.

Thanks again!

Qubyte

unread,
Jan 25, 2012, 6:27:37 AM1/25/12
to pandoc-...@googlegroups.com
I'm moving on to looking at json now. I've written two versions of the handleRuby function, where the second handles LaTeX I'm guessing that a switch for the language needs to be handed down to handleRuby somehow, but Haskell is still somewhat mysterious to me. In fact I'm almost certain that my comprehension has been hobbled with a decade+ of coding C... I'm waiting for it to click. Is there some sample code somewhere for using json? I haven't had much luck finding any. Any tips on the below would be very much appreciated.

-- handleRuby.hs
import Text.Pandoc

-- handleRuby :: Inline -> Inline
-- handleRuby (Link (Str ruby : []) ('-':kanji,_)) = RawInline "html" rubyHtml
--   where rubyHtml = ("<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ ruby ++ "</rt><rp>)</rp></ruby>")
-- handleRuby x = x

handleRuby :: Inline -> Inline
handleRuby (Link (Str ruby : []) ('-':kanji,_)) = RawInline "latex" rubyLaTeX
  where rubyLaTeX = ("\\ruby{" ++ kanji ++ "}{" ++ ruby ++ "}")
handleRuby x = x 

handleRuby :: Inline -> Inline 
 
main :: IO ()
main = interact $ jsonFilter (bottomUp handleRuby)

Thanks again!

John MacFarlane

unread,
Jan 25, 2012, 1:25:59 PM1/25/12
to pandoc-...@googlegroups.com
+++ Qubyte [Jan 25 12 03:27 ]:

> I'm moving on to looking at json now. I've written two versions of the
> handleRuby function, where the second handles LaTeX I'm guessing that a
> switch for the language needs to be handed down to handleRuby somehow,
> but Haskell is still somewhat mysterious to me. In fact I'm almost
> certain that my comprehension has been hobbled with a decade+ of coding
> C... I'm waiting for it to click. Is there some sample code somewhere
> for using json? I haven't had much luck finding any. Any tips on the
> below would be very much appreciated.

Without any testing, here's a modified version of what you had
that will give you support for both html and latex:

-- handleRuby.hs
-- compile with: ghc --make handleRuby
-- run with:
-- pandoc -t json | ./handleRuby $FORMAT | ./pandoc -f json -t $FORMAT
-- where $FORMAT is either "html" or "latex"
import Text.Pandoc
import System.Environment (getArgs)

type Format = String

handleRuby :: Format -> Inline -> Inline
handleRuby "html" (Link (Str ruby : []) ('-':kanji,_)) = RawInline "html"
$ "<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ ruby ++ "</rt><rp>)</rp></ruby>"
handleRuby "latex" (Link (Str ruby : []) ('-':kanji,_)) = RawInline "latex"
$ "\\ruby{" ++ kanji ++ "}{" ++ ruby ++ "}"
handleRuby _ x = x

main :: IO ()
main = do
args <- getArgs
case args of
[format] -> interact $ jsonFilter $ bottomUp (handleRuby format)
_ -> error "Usage: handleRuby (html|latex)"

Qubyte

unread,
Jan 29, 2012, 3:35:45 AM1/29/12
to pandoc-...@googlegroups.com
In the end I settled on the below, which is essentially your code. I added an additional couple of output options which just strip one of the ruby text or the kanji, but I used RawInline for the markdown output. Is this the proper way of doing it? It seems to perform the intended job. For these two options, essentially all that is happening is that the custom links get replaced with normal text and it becomes standard markdown again.

An example use is: pandoc -f markdown -t json input.markdown | ./handleRuby kanji | pandoc -f json -t markdown

-- handleRuby.hs
import Text.Pandoc
import System.Environment (getArgs)
 
handleRuby :: String -> Inline -> Inline
handleRuby "html" (Link (Str ruby : []) ('-':kanji,_)) = RawInline "html"
 $ "<ruby>" ++ kanji ++ "<rp>(</rp><rt>" ++ ruby ++ "</rt><rp>)</rp></ruby>"
handleRuby "latex" (Link (Str ruby : []) ('-':kanji,_)) = RawInline "latex"
 $ "\\ruby{" ++ kanji ++ "}{" ++ ruby ++ "}"
handleRuby "kanji" (Link txt ('-':kanji,_)) = RawInline "markdown"
 $ kanji
handleRuby "kana" (Link (Str ruby : []) ('-':kanji,_)) = RawInline "markdown"
 $ ruby
handleRuby _ x = x
 
main :: IO ()
main = do
 args <- getArgs
 case args of
   [format] -> interact $ jsonFilter $ bottomUp (handleRuby format)
   _        -> error "Usage:  handleRuby (html|latex|kanji|kana)"

Thanks again, this has been a really interesting experience!

John MacFarlane

unread,
Jan 29, 2012, 1:36:48 PM1/29/12
to pandoc-...@googlegroups.com
+++ Qubyte [Jan 29 12 00:35 ]:

> In the end I settled on the below, which is essentially your code. I
> added an additional couple of output options which just strip one of
> the ruby text or the kanji, but I used RawInline for the markdown
> output. Is this the proper way of doing it? It seems to perform the
> intended job. For these two options, essentially all that is happening
> is that the custom links get replaced with normal text and it becomes
> standard markdown again.

For a bit more generality you could avoid using RawInline
for these -- that would allow you to do the stripping operations
for any output format.

handleRuby "kanji" (Link txt ('-':kanji,_)) = Str kanji
handleRuby "kana" (Link (Str ruby : []) ('-':kanji,_)) = Str ruby

Qubyte

unread,
Feb 14, 2012, 6:10:08 AM2/14/12
to pandoc-...@googlegroups.com
Thanks again for all the tips. I've put the filter on GitHub:

https://github.com/qubyte/Japanese-Guide

It's in the scripts folder. There are some (pretty bad, I should put some work into them) notes in this project that it can be tried out on too.

HansBKK

unread,
Feb 14, 2012, 7:55:06 AM2/14/12
to pandoc-...@googlegroups.com
Note that your statement in the readme about your package requiring the user to compile from dev may no longer be true now that 1.9 binaries are released for download (currently 1.9.1.1).

Qubyte

unread,
Feb 14, 2012, 7:56:30 AM2/14/12
to pandoc-...@googlegroups.com
Nice! I'll update the readme. Thanks for the heads up.

John MacFarlane

unread,
Feb 14, 2012, 11:27:46 AM2/14/12
to pandoc-...@googlegroups.com
In fact, I've included a version of the filter as an example
in the scripting guide:

http://johnmacfarlane.net/pandoc/scripting.html#a-filter-for-ruby-text

Hope you don't mind. It's a really nice example, I think.

John

+++ Qubyte [Feb 14 12 04:56 ]:


> Nice! I'll update the readme. Thanks for the heads up.
>

> --
> You received this message because you are subscribed to the Google
> Groups "pandoc-discuss" group.
> To view this discussion on the web visit
> [1]https://groups.google.com/d/msg/pandoc-discuss/-/yJ3TUkP7Bd0J.
> To post to this group, send email to pandoc-...@googlegroups.com.
> To unsubscribe from this group, send email to
> pandoc-discus...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/pandoc-discuss?hl=en.
>
> References
>
> 1. https://groups.google.com/d/msg/pandoc-discuss/-/yJ3TUkP7Bd0J

Qubyte

unread,
Feb 14, 2012, 11:46:34 AM2/14/12
to pandoc-...@googlegroups.com
Not at all! It looks rather fetching there.
Reply all
Reply to author
Forward
0 new messages