Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Emacs Lisp: HTML Processing: Split Annotation (tutorial)

18 views

Skip to first unread message

Xah Lee

unread,

Aug 17, 2011, 9:11:29 PM8/17/11

new article

〈Emacs Lisp: HTML Processing: Split Annotation〉
http://xahlee.org/emacs/elisp_text_processing_split_annotation.html

would you like to keep seeing my tutorial posted to comp.lang.lisp and/
or comp.emacs?
I haven't been posting every one. Perhaps just occasionally?

let me know.

plain text version follows
------------------------------------

Emacs Lisp: HTML Processing: Split Annotation

Xah Lee, 2011-08-16

This page shows a example of emacs lisp for a text processing HTML.
The HTML files are classical novels. The annotation markups need to
change from one format into another. There are hundreds of such pages
that need to be processed.

-----------------------------
Problem

-----------------------------
Summary

For all HTML files in a directory, find any annotation markup
containing the bullet “•” symbol:

Split the annotation into multiple markups, like this:

-----------------------------
Detail

If you are a contract web dev programer, then you know that 99.99% of
websites are a messy text soup. They are created by hundreds of tools
or languages. Word processors, HTML generators, tens of lighweight
markup langs, different frameworks from different languages PHP, Perl,
Python, Ruby, from different web era, from different programers in the
past. Even emacs has several modes that generate HTML. They are not in
any consistent form. Often, they have missing tags too.

It is in these situations, emacs shines thru, because emacs's powerful
embedded language lisp, and its interactive nature, lets you maximize
automation. Interactively when you are still feeling the pattern, then
by Keyboard Macro or emacs lisp for parts that can be automated.

For my website, i take the time to make sure that my all my HTML are
consistent. But still, they are written in the span of 15 years.
Periodically i take the time to improve the markup. For example, when
new version of CSS or HTML are widely adopted by web browsers. (CSS1
to 2 to 3, HTML 3 to 4 to HTML5.)

I have hundreds of pages of classic novels as HTML documents. These
documents contain annotations in special HTML markup. For example,
here's sample annotation from Titus Andronicus: Act 1:
• short ⇒ rudely brief. (AHD)
• sharp ⇒ Fierce, impetuous, hash, severe… (AHD)

SATURNINUS. 'Tis good, sir. You are very short with us;
But if we live we'll be as sharp with you.

Here's the raw HTML:

<div class="x-note">• short ⇒ rudely brief. (AHD)<br>
• sharp ⇒ Fierce, impetuous, hash, severe… (AHD)</div>

<pre class="tx">SATURNINUS. 'Tis good, sir. You are very <span
class="xnt">short</span> with us;
But if we live we'll be as <span class="xnt">sharp</span> with you.
</pre>

Here's how the tag works. Each <span class="xnt"> markup a word in
main text. When a word is marked by “span.xnt”, that means it has a
sidebar annotation. The sidebar section is marked by <div class="x-
note">. Inside the “div.x-note”, there may be more than one entries.
Each entry starts with the bullet symbol “•”. For example, in the
above, the words “short” and “sharp” are both entries inside a “div.x-
note” sidebar.

But recently, i think it is better to have one entry per sidebar. This
way, it makes the logic simpler, and is much easier if i want to add
Javascript functionality. For example, clicking on a word in main text
to highlight its annotation.

So, i want write a elisp script to process all my files. If you simply
read the spec for this job, of splitting a markup by a particular
character, you may think it's trivial and can be done in any lang in
10 minutes. Why then the elaborate discussion about text soup
situation?

The important thing is that i DO NOT know what needs to be done to
begin with. Only after having used emacs power together with lisp
script i wrote before to look at and check my existing markup in
hundreds of files, then i know what state they are and decide on what
i want to do. Also, this change must be done with the ability to
visually check that all changes are done correctly, because the input
may not be in the format i expect. (it might be missing the bullet
“•”.)

For those Scheme Lisp academic computer science folks, you might
wonder, when i started with these annotations, why didn't i “design”
it well to begin with. The reason is that, when i write a blog
article, or my literature annotation project, i really want focus on
the writing first, the content, get it done, rather than get
distracted by the CSS/HTML markup design. (one thing i do make sure is
that whatever CSS/HTML i device, i made sure that they can be easily
changed systematically later by a simple parsing.) I devote
significantly more percentage of time on design than most people, but
many factors necessitates change. For example, you may not know CSS as
well before, and the thoughts of HTML semantics is quite complex.
(e.g. see: Are You Intelligent Enough to Understand HTML5?.) Browsers
change, standards changes (e.g. HTML → XHTML → HTML5. See: HTML5
Doctype, Validation, X-UA-Compatible, and Why Do I Hate Hackers.),
thoughts of best practices change, and my needs for the annotation
also changed through-out the years.

-----------------------------
Solution

Here's the outline of steps:

Open the file. Search for the tag we want.
Check if the tag contains a bullet “•”.
If so, replace the bullet char with new end tag and beginning tag.
e.g. • ⇒ </div> <div>
Do this for all files in a dir. (or a given list of files)

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-08-13
;; process all files in a dir.
;; split any markup like this:
;; <div class="x-note">… • … • …</div>
;; by the bullet •
;; into several x-note tags

(setq inputDir "~/web/xahlee_org/p/" )

;; add a ending slash if not there
(when (not (string= "/" (substring inputDir -1) )) (setq inputDir
(concat inputDir "/") ) )

;; files to process
(setq fileList
[
"~/web/xahlee_org/p/arabian_nights/aladdin/aladdin4_1.html"
"~/web/xahlee_org/p/arabian_nights/aladdin/aladdin3.html"
]
)

(defun my-process-file-xnote (fpath)
"process the file at fullpath FPATH …"
(let (myBuffer (ξcounter 0) p1 p2 ξmeat
ξmeatNew
(changedItems '())
(tagBegin "<div class=\"x-note\">" )
(tagEnd "</div>" )
)

(require 'sgml-mode)
(when t

(setq myBuffer (find-file fpath))
(goto-char 1)
(while (search-forward "<div class=\"x-note\">" nil t)

;; capture the x-note tag text
(setq p1 (point))
(backward-char 1)
(sgml-skip-tag-forward 1)
(backward-char 6)
(setq p2 (point))
(setq ξmeat (buffer-substring-no-properties p1 p2))

;; if it contains a bullet
(when (string-match "•" ξmeat)
(setq ξcounter (1+ ξcounter))

;; clean the text. Remove some newline and <br> that's no
longer needed
(setq ξmeat (replace-regexp-in-string "\n*• *" "•" ξmeat t
t ) )
(setq ξmeat (replace-regexp-in-string "\n$" "" ξmeat t
t ) ) ; delete ending eol
(setq ξmeat (replace-regexp-in-string "<br>•" "•" ξmeat t
t ) )

;; put the new entries into a list, for later reporting
(setq changedItems (split-string ξmeat "•" t) )

;; break the bullet into new end/begin tags
(setq ξmeatNew (replace-regexp-in-string "•" (concat tagEnd
"\n" tagBegin) ξmeat t t ) )

(goto-char p1)
(delete-region p1 p2)
(insert ξmeatNew)

;; remove the newline before end tag
(when (looking-back "\n") (delete-backward-char 1))
)
)

;; report if the occurance is not n times
(when (not (= ξcounter 0))
(princ "-------------------------------------------\n")
(princ (format "%d %s\n\n" ξcounter fpath))

(mapc (lambda (ξx) (princ (format "%s\n\n" ξx)) )
changedItems)
)

;; close buffer if there's no change. Else leave it open.
(when (not (buffer-modified-p myBuffer)) (kill-buffer
myBuffer) )
)
))

(require 'find-lisp)

(let (outputBuffer)
(setq outputBuffer "*xah x-note output*" )
(with-output-to-temp-buffer outputBuffer
;; (mapc 'my-process-file-xnote fileList)
(mapc 'my-process-file-xnote (find-lisp-find-files inputDir "\
\.html$"))
(princ "Done deal!")
)
)

Here's a sample output: elisp_text_processing_split_annotation.txt

I've put lots comments in the code. It should be easy to understand.
If any part you don't understand, ask me. If you are new to elisp,
checkout the first few section of Emacs Lisp Tutorial.

The weird ξ you see in my elisp code is Greek x. I use unicode char in
variable name for experimental purposes. You can just ignore it. (See:
Programing Style: Variable Naming: English Words Considered Harmful.)

I ♥ emacs.

Xah

0 new messages