[dev] reading an epub book with less: adventures in text processing

32 views
Skip to first unread message

Greg Reagle

unread,
Mar 9, 2024, 10:32:01 AM3/9/24
to d...@suckless.org
I have an epub ebook. It is a novel, but when I get this process working, I want to repeat it for any epub ebook.

I want to read it, with formatting (such as underline or italics), with less. I am happy to use any software that exists in the process, but I MUST use less in the end to read it. The terminal emulators that I use are usually st, xterm, and termux. All of them are capable of colored text and underlining and so forth, and I want to take advantage of this.

Pandoc does a very good job converting epub to html, and it looks good with w3m, however when I use w3m in a pipe, the output is truly *plain* text, meaning there are no escape codes for formatting. Same story with elinks. Is it possible to get either of these programs, or some other program, to dump html to text *with* escape codes?

Since I could not get HTML to work, I went with man format. Amazing. Pandoc automatically chooses man format for output based on the '.1' extension in the followingv
pandoc --standalone -o City_of_Truth-Morrow.1 City_of_Truth-Morrow.epub
Remember to use standalone option or it won't work. Then
man --local-file --pager 'less -ir' City_of_Truth-Morrow.1
It looks great! (for text only on a terminal) It has bold and underlined text. From there I can use less 's' command to save the formatted text to a file.

There might be a better or more direct way of achieving this goal, but this I what I figured out for now. And the rationale is this: I already know and love less. There is no good reason for me to learn the user interface of a different program like an epub reader or an html reader to read a book that does not have graphics, diagrams, pictures, and/or custom formatting.

Greg Reagle

unread,
Mar 9, 2024, 11:32:27 AM3/9/24
to d...@suckless.org
On Sat, Mar 9, 2024, at 9:34 AM, Greg Reagle wrote:
> I want to read it, with formatting (such as underline or italics), with
> less.

Or, I would be satisfied with an ebook reader program (either TUI or GUI is fine) that has the same functionality and keys as less. Of course it can have some extra functionality and keys that would be useful specifically for reading or annotating an ebook, just not conflicting with less behavior.

I like the idea of a program that is backwards-compatible with less, in terms of functionality and keys. Maybe I will see if I can make a less-compatible branch for something like mupdf.

Hiltjo Posthuma

unread,
Mar 9, 2024, 11:34:19 AM3/9/24
to dev mail list
Hi,

Maybe mupdf/mutools or the eGhostscript tools o qpdf?

--
Kind regards,
Hiltjo

Greg Minshall

unread,
Mar 9, 2024, 1:35:21 PM3/9/24
to dev mail list
Greg,

thanks for this!

for some personal tastes/usage cases, this, using pandoc's `-t`
option, might be minor-ly simpler:
----
man --local-file --pager 'less -ir' \
<(pandoc --standalone -t man \
2015.31233.Arab-Geographers-Knowledge-Of-Southern-India.epub) | less
----

and, this deserves to be somewhere like fortune: "I already know and
love less.". :) maybe "fortune-mod-fles-pleh"? :)

cheers, Greg


Georg Lehner

unread,
Mar 9, 2024, 5:07:01 PM3/9/24
to d...@suckless.org
Hi Greg,
Just modify your workflow slightly and you are good:

Option 1: use w3m

pandoc -s -t html City_of_Truth-Morrow.epub | w3m -T text/html

Option 2: use man/less

pandoc -t man City_of_Truth-Morrow.epub | man -l -

Option 3, save as html for future use:

pandoc -s  -o City_of_Truth-Morrow.html City_of_Truth-Morrow.epub

Saves your epub to html. Whenever you want to view it, use your favorite
browser, i.e. w3m, with all its features.

Option 4: save as man:

pandoc -s -t man -o City_of_Truth-Morrow.man City_of_Truth-Morrow.epub

Whenever you view it, use: man -l City_of_Truth-Morrow.man

- - -

Some notes:

The reason you loose formatting when saving from less(1) or w3m is, that
these programs on purpose do not save the terminal control characters
which are doing the markup. Line breaks and terminal control are created
on demand, depending on the type and size of the terminal (window) and
will display different (weird) when any of this is different from the
terminal you (would have) saved them to a file.

The -s option (--standalone) option for Pandoc is not required for man
page output. For html (and other formats) pandoc outputs only the <body>
content, the -s options wraps this into a complete <html> document.

Best Regards,


  Georg


Greg Reagle

unread,
Mar 11, 2024, 11:02:44 AM3/11/24
to d...@suckless.org
On Sat, Mar 9, 2024, at 11:33 AM, Hiltjo Posthuma wrote:
> Maybe mupdf/mutools or the eGhostscript tools o qpdf?

Yes, thank you for this excellent advice. I tried "mutool convert", but I am more satisfied with pandoc's output, for both text and html output (from epub).

Greg Reagle

unread,
Mar 11, 2024, 11:30:37 AM3/11/24
to d...@suckless.org
On Sat, Mar 9, 2024, at 4:06 PM, Georg Lehner wrote:
> Option 1: use w3m
[snip]

All great commands. Thank you.

> The reason you loose formatting when saving from less(1) or w3m is, that
> these programs on purpose do not save the terminal control characters
> which are doing the markup. Line breaks and terminal control are created
> on demand, depending on the type and size of the terminal (window) and
> will display different (weird) when any of this is different from the
> terminal you (would have) saved them to a file.

Yes I have noticed this. I would like to be able to tell programs to keep the formatting, but they decide automatically on their own to remove it. The automatic decision to keep or remove formatting based on terminal type is fine, but I find it very annoying that I cannot override this decision with many programs. GNU's ls is an exception (with the --color option). I would like to tell w3m or elinks to dump html and keep the formatting, which they cannot do (directly). There are ways around that cause extra steps.

> The -s option (--standalone) option for Pandoc is not required for man
> page output.

Well it definitely is for me, meaning the version of Pandoc that I use: 2.17.1.1-2~deb12u1 amd64

Greg Reagle

unread,
Mar 11, 2024, 11:36:44 AM3/11/24
to dev mail list
On Sat, Mar 9, 2024, at 12:53 PM, LM wrote:
> You could try modifying sdlbook or bard. It would be nice if either of these offered keymapping functionality like some programming editors do.

Thank you for telling me about these two programs. I had not heard of them.

https://github.com/rofl0r/SDLBook
https://github.com/festvox/bard

Greg Reagle

unread,
Mar 11, 2024, 12:01:47 PM3/11/24
to d...@suckless.org
I think I finally figured it out! With help, of course, from my wise and helpful community. Thanks! And reading the man page for elinks. :>

for direct viewing in less:
pandoc -s -t html City_of_Truth-Morrow.epub | elinks -dump-color-mode 2 -force-html | less -ir

to make a file to keep, for repeated viewing in less:
pandoc -s -t html City_of_Truth-Morrow.epub | elinks -dump-color-mode 2 -force-html > City_of_Truth-Morrow-formatted.txt

Now my next question is, what is the tool that does the *best* job of turning a PDF book into a readable text document? Via html or docbook or markdown or whatever--doesn't matter. My previous experience trying things out to achieve this goal is that it's just not worth it. The output always winds up un-readable.

Greg Reagle

unread,
Mar 11, 2024, 12:06:48 PM3/11/24
to d...@suckless.org
On Sat, Mar 9, 2024, at 1:15 PM, Greg Minshall wrote:
> for some personal tastes/usage cases, this, using pandoc's `-t`
> option, might be minor-ly simpler:
> ----
> man --local-file --pager 'less -ir' \
> <(pandoc --standalone -t man \
> 2015.31233.Arab-Geographers-Knowledge-Of-Southern-India.epub) | less
> ----

Very cool command. Good idea to use process substitution. Here is another way of doing it:
pandoc --standalone -t man City_of_Truth-Morrow.epub | man /dev/stdin
but I don't know how portable /dev/stdin is.

Κρακ Άουτ

unread,
Mar 11, 2024, 2:26:06 PM3/11/24
to dev mail list
On 2024-03-11 17:44 Greg Reagle <li...@speedpost.net> wrote:

> Now my next question is, what is the tool that does the *best* job of
> turning a PDF book into a readable text document? Via html or
> docbook or markdown or whatever--doesn't matter. My previous
> experience trying things out to achieve this goal is that it's just
> not worth it. The output always winds up un-readable.

I use pdftotext from poppler-utils. It does quite good job.

This is my main pdf reader command:
```
pdftotext -layout -nopgbrk ${1@Q} - | less -MS --use-color
```


Viktor Grigorov

unread,
Mar 11, 2024, 2:58:38 PM3/11/24
to dev mail list

Rather late to the party and I've already forgotten the initial email. Nevertheless, I'll give the program I most use: epub2txt.[0] It's not perfect, but compared to calibre's ebook-convert, and everything else I found in C in github or codeberg or gitlab, it's the best. A once-over with an editor capable of multiple selection and edition is the most I've had to do. Faulty output includes, say, only a single letter rather than a whole word capitalised or within '\e[...m' and '\e[0m'.

Protip; Run it with -w 0 to get 'natural' paragraphs.


[0] https://github.com/kevinboone/epub2txt2



Reply all
Reply to author
Forward
0 new messages