Script for `se titlecase` on alll chapters at once

316 views
Skip to first unread message

David

unread,
Nov 27, 2023, 4:42:40 PM11/27/23
to Standard Ebooks
From time to time my projects have chapter titles from PG like: "A BRAVE RESCUE AND A ROUGH RIDE". Of course, we want: "A Brave Rescue and a Rough Ride". If you have 50, 60, or more chapters in a book, applying `se titlecase -n` to all these strings gets a bit tedious.

With some help from a friendly (not SO) coding Q&A site, I've come up with a script that replaces all chapter UPPERCASE titles with the `se titlecase` versions in one go.

It makes a couple of assumptions about line of file, and column of line, but once you know those two details, it works in a blink (or two) of an eye. `se clean` ensures that for a given project, the line/column will be consistent on a per-project basis.

Anyone interested in checking / testing / improving can find it in this public gist. Do see the comment under the four lines  of code for a bit of commentary, and a quick source of test data.

Hope this helps someone else, too!

David / Fife, UK

Vince

unread,
Nov 27, 2023, 9:59:31 PM11/27/23
to Standard Ebooks
Interesting idea! It does make a lot of assumptions that should not be assumed, though. :) There are a myriad of ways that chapter headers can be formatted, and this is just one of them. It’s not as simple as just changing the numbers in the code, though; some books have different sets of elements in different parts of the book, and it’s usually the larger books where that occurs.

Better to just look for the thing that has an epub:type=“title” (which could be a <p>, an <h#>, or a <span>), and then run it on what’s in that element. You don’t have to care what the element is; you should be able to just search for
epub:type=“title”>(.*?)</
And although subtitles are relatively rare, they do occur, so the same could be done for them.

David

unread,
Nov 28, 2023, 4:08:17 AM11/28/23
to Standard Ebooks
Helpful - thanks, Vince!

There are, as you point out, plenty of variations across projects. For the sorts of things I tackle, this seemed a reasonable approach.

I the value of the `epub:type=“title”` search ... but, being a slavish follower of the Step-by-Step guide, I tend to do the titlecasing with the typography work (step #8) so well before the semantics at step #13.

Well, this is very much a "FWIW" sort of thing, and an interesting scripting problem for a rank amateur (could you tell?) like me!

D.

David at Standard Ebooks

unread,
Nov 28, 2023, 4:57:05 AM11/28/23
to Standard Ebooks
It seems to me that titlecasing could be incorporated into the se build-titles function without too much difficulty (though I haven't looked recently at that code). Is there a reason it couldn't happen there, even if it were an optional parameter?
--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/6b855b7a-12bf-4cc3-a024-fce7f0f40816n%40googlegroups.com.

David

unread,
Nov 28, 2023, 5:55:29 AM11/28/23
to Standard Ebooks
And while I think about Vince's suggestion (do the titlecase post-semantics), I note another "gotcha": if the chapter title is like:

    <p epub:type="title"><i xml:lang="la">QUO WARRANTO</i>?</p>

Then `se titlecase` will produce "quo Warranto", not "Quo Warranto". So my rough-ready script speeds up some things, but it isn't foolproof by any means. The results would (obviously) need checking (a quick `grep` to display each title would display them quickly enough for that).

David / Fife, UK

Vince

unread,
Nov 28, 2023, 12:52:41 PM11/28/23
to standar...@googlegroups.com
Ooh, that’s a great idea; it’s a wonder no one’s thought of it before.

Alex, would that be acceptable? If so, I’ll look into that.

Brian

unread,
Nov 28, 2023, 7:58:22 PM11/28/23
to standar...@googlegroups.com
If you make it happen automatically as part of se build-titles, then
it should be optional, or it should be possible to disable it, since
there are times when you don't want titlecasing.

(At the very least, it should skip titlecasing if the element in
question also has an xml:lang tag.)

Vince

unread,
Nov 28, 2023, 8:21:43 PM11/28/23
to standar...@googlegroups.com
Yes, David included having an option was included in the idea.

Alex Cabal

unread,
Nov 29, 2023, 12:49:13 PM11/29/23
to standar...@googlegroups.com
I think this would be more complicated than it might seem, because while
it's easy to *locate* the title based on heading rules, *updating* the
title is not so easy.

For example, you found the title:

<h2><abbr>mr.</abbr> smith reads <i>moby dick</i></h2>

It gets titlecased into the string "Mr. Smith Reads Moby Dick"

How do you translate reinsert that string, back into the correct nested
tags?

Howard Cornett

unread,
Nov 29, 2023, 3:05:45 PM11/29/23
to standar...@googlegroups.com
I'm not familiar with how that script works, but couldn't you just run everything between the <h2> tags through? Since you are currently removing the other tags, you should be able to capitalize the text words while leaving the tags untouched.

Howard


--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

David

unread,
Nov 29, 2023, 3:21:28 PM11/29/23
to Standard Ebooks
Actually, Alex's example gets titlecased by the se command this way:

    <h2><abbr>mr.</abbr> Smith Reads <i>moby Dick</i></h2>

which is exactly what I expected from my own (perhaps naive) script experiments.

I also have a version that looks for the `epub:type=“title”` line and titlecases that (at the same gist linked in my original post).

Meanwhile, I'm heartened to know that at least under certain, perhaps limited, conditions, that the job of bringing some PG UPPERCASE titles to SE standards need not be so tedious as it has been. :)

David / Fife, UK

David at Standard Ebooks

unread,
Nov 29, 2023, 8:28:07 PM11/29/23
to Standard Ebooks
Alex is correct, of course. But it all comes down to WHEN you run the titlecase function. If you do it early, before semanticate and before manually putting italics around a book title or whatever (and I can’t imagine anyone doing that without having first titlecased the title), then those problems won’t happen. 

You would have to check each time, though, if someone ran or re-ran the function late in the production process (perhaps just by checking for any tags in the target string).

David

unread,
Nov 30, 2023, 3:53:13 AM11/30/23
to Standard Ebooks
I'm puzzled, David and Alex, that you seem to be suggesting that "internal" tags get stripped by `se titlecase`. But ... they don't!

The example Alex gave was: `<h2><abbr>mr.</abbr> smith reads <i>moby dick</i></h2>`. (Surely never to be seen in a PG transcription, but ... for the sake of argument!) But `se titlecase` produces this string:

    <h2><abbr>mr.</abbr> Smith Reads <i>moby Dick</i></h2>

The issue isn't that the `abbr` and `i` tags are stripped (they aren't), but that `titlecase` doesn't catch the letter after the `>` character, in our example, the `m` in "Mr" and "Moby".

One way or t'other, though, I remain hopeful that scripting this (even in my naive ways, tweaked "per project") will save time and aggro.

FWIW!! David / Fife, UK

On Thursday, 30 November 2023 at 01:28:07 UTC David at Standard Ebooks wrote:
Alex is correct, of course. But it all comes down to WHEN you run the titlecase function. If you do it early, before semanticate and before manually putting italics around a book title or whatever (and I can’t imagine anyone doing that without having first titlecased the title), then those problems won’t happen. 

You would have to check each time, though, if someone ran or re-ran the function late in the production process (perhaps just by checking for any tags in the target string).

On 30 Nov 2023 at 7:21 AM +1100, David wrote:
Actually, Alex's example gets titlecased by the se command this way:

    <h2><abbr>mr.</abbr> Smith Reads <i>moby Dick</i></h2>

which is exactly what I expected from my own (perhaps naive) script experiments.

I also have a version that looks for the `epub:type=“title”` line and titlecases that (at the same gist linked in my original post).

Meanwhile, I'm heartened to know that at least under certain, perhaps limited, conditions, that the job of bringing some PG UPPERCASE titles to SE standards need not be so tedious as it has been. :)

David / Fife, UK

On Wednesday, 29 November 2023 at 20:05:45 UTC Howard wrote:
I'm not familiar with how that script works, but couldn't you just run everything between the <h2> tags through? Since you are currently removing the other tags, you should be able to capitalize the text words while leaving the tags untouched.

Howard


On Wed, Nov 29, 2023 at 12:49 PM Alex Cabal wrote:
I think this would be more complicated than it might seem, because while
it's easy to *locate* the title based on heading rules, *updating* the
title is not so easy.

For example, you found the title:

<h2><abbr>mr.</abbr> smith reads <i>moby dick</i></h2>

David at Standard Ebooks

unread,
Nov 30, 2023, 5:22:27 AM11/30/23
to Standard Ebooks
David, yes, exactly. 

That’s the reason it’s not as simple as it first appears. Any titlecasing function, (if it is to be able to be applied at any time) would need to read the title without the tags to do its work, and then the tags would have to be put back

Alex’s example, though contrived, is certainly a common enough situation once the producer has run the semanticate function and manually applied italic tags.

My point is only that generally the time I want to apply titlecasing to all the titles is very early on, before I run semanticate and certainly before I semantically tag things like book names. Many Gutenberg transcriptions have subtitles of chapters in ALL CAPS which is why an automated titlecase process early on would be useful.
--

You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

David

unread,
Nov 30, 2023, 6:07:07 AM11/30/23
to Standard Ebooks
Okay, thanks David - I see I misunderstood what Alex was suggesting, which seemed to me to imply that `titlecase` itself removed tags.

Like you, I do this task (as per Step-by-step) well before semanticating. Encountering a few PG transcriptions (of 50, 76, and more chapters) with ALL CAPS TITLES inspired my pursuit of a scripted approach.

I have never in my life written even a scrap in Python, but it would seem possible that on the analogy of the formatting of characters following [em|en]-dash, so too a character following `>` could uppercased. (And seeing that that function lowercases the string before doing its work makes sense [to me!] of the way Alex provided his example.)

Pretty complicated. :) But worth doing some headscratching over all the same, I reckon. Even with the tools in current shape, it is (IMO) very hopeful.

David / Fife, UK

Erin

unread,
Nov 30, 2023, 5:32:46 PM11/30/23
to Standard Ebooks
Yes, a regex replacement that uppercases any lowercase letter after ‘>’ can easily be added, David. As you say, it’s exactly analogous to some other regex replacements already made in the titlecase function. This line does it:

text = regex.sub(r">([\p{Lowercase_Letter}])", lambda result: ">" + result.group(1).upper(), text)

With that, `se titlecase "<abbr>mr.</abbr> smith reads <i>moby dick</I>"` does give the desired result: `<abbr>Mr.</abbr> Smith Reads <i>Moby Dick</i>`.

There are obviously some words within tags whose first letters we wouldn't want uppercased: `etc.` for example. But since `etc.` is already manually lowercased in the titlecase function, the correct result could be achieved by making sure that the above replacement (uppercasing after `>`) occurs before that replacement (lowercasing `etc.`). The same applies to other special cases that are lowercased later on in titlecase; we'd need to make sure the replacement above occurred before those, so that if it uppercased a letter after `>` that we don't want uppercased, that would get fixed by the later special-case replacements in the function.

I'm not sure whether it's acceptable to make the result depend in this way on the order of the replacements, though.

There are also probably other cases I haven't thought of that this approach doesn’t cover. Getting the tagless string, titlecasing it, then adding back the tags seems cleaner and would ensure the desired result in all cases. So if this functionality is wanted in the tools it's probably worth looking into that approach that David and Alex mentioned. Still, it seems like the above would be a simple way to cover many cases.

Alex Cabal

unread,
Nov 30, 2023, 9:20:29 PM11/30/23
to standar...@googlegroups.com
That's not how our titlecasing algorithm works. It's much more nuanced
than just uppercasing every word. You can inspect the source to see how
complex it really gets.
> <https://github.com/standardebooks/tools/blob/ff249ef2e636ad281fd29698fda52ec8b0359e08/se/formatting.py#L1157>, so too a character following `>` could uppercased. (And seeing that that function lowercases the string before doing its work makes sense [to me!] of the way Alex provided his example.)
>
> Pretty complicated. :) But worth doing some headscratching over all
> the same, I reckon. Even with the tools in current shape, it is
> (IMO) very hopeful.
>
> David / Fife, UK
>
> On Thursday, 30 November 2023 at 10:22:27 UTC David at Standard
> Ebooks wrote:
>
> David, yes, exactly.
>
> That’s the reason it’s not as simple as it first appears. Any
> titlecasing function, (if it is to be able to be applied at any
> time) would need to read the title *without* the tags to do its
> work, and then the tags would have to be *put back*.
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/3a2e2bd3-861a-47f6-a77e-194eb8d53ff9n%40googlegroups.com <https://groups.google.com/d/msgid/standardebooks/3a2e2bd3-861a-47f6-a77e-194eb8d53ff9n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Erin

unread,
Nov 30, 2023, 10:55:34 PM11/30/23
to Standard Ebooks
I wasn't suggesting that the algorithm consisted in uppercasing every word. I didn't think that, as I was looking at the source, at least if this is it. That is why I mentioned how "etc." is handled. David's suggestion seemed to work at least for some cases. But I must be missing something, sorry.

Erin

unread,
Dec 1, 2023, 3:15:12 AM12/1/23
to Standard Ebooks
OK, I can see the mistaken assumption in what I said. Sorry about that.

David at Standard Ebooks

unread,
Dec 1, 2023, 8:56:48 PM12/1/23
to Standard Ebooks
I was hoping to do this as a pull request for the SE toolkit, but I'm having trouble installing that in editable form.

Anyway, an approach like this might work:

https://github.com/drgrigg/tagged_titlecaser

Input: <h2 epub:type="title"><abbr>MR.</abbr> DARCY WAS READING <i epub:type="se:name.publication.book">MOBY-DICK</i></h2>
Output: <h2 epub:type="title"><abbr>Mr.</abbr> Darcy Was Reading <i epub:type="se:name.publication.book">Moby-Dick</i></h2>

David at Standard Ebooks

unread,
Dec 2, 2023, 5:20:06 AM12/2/23
to Standard Ebooks
I've now got the SE toolkit installed in editable form, so if you think the approach I've suggested in my stand-alone repository is worth adopting, I'll have a go at a pull request.
--

You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/86cac017-6e3b-4071-a2a7-5c117c211bce%40Spark.

Alex Cabal

unread,
Dec 3, 2023, 12:46:36 AM12/3/23
to standar...@googlegroups.com
Can you send a list of test cases that explore some of the more nuanced
titlecasing problems?
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/86cac017-6e3b-4071-a2a7-5c117c211bce%40Spark <https://groups.google.com/d/msgid/standardebooks/86cac017-6e3b-4071-a2a7-5c117c211bce%40Spark?utm_medium=email&utm_source=footer>.

David at Standard Ebooks

unread,
Dec 3, 2023, 7:29:17 PM12/3/23
to standar...@googlegroups.com
I've uploaded to my respository a bunch of likely problem cases drawn from the corpus, with the results of running my routine on them (note that it now calls the actual SE titlecase routine by shelling out to it).

There are certainly some problems. A common one is where the title of a book or a poem has been tagged with <i> and semantics, because my routine at present just gloms it all together with the rest of the title and so words like 'the' don't get correctly titlecased, eg:

WRITTEN ON THE BLANK SPACE AT THE END OF CHAUCER’S TALE OF <i epub:type="se:name.publication.poem">THE FLOURE AND THE LEFE</i>

becomes:

Written on the Blank Space at the End of Chaucer’s Tale of <i epub:type="se:name.publication.poem">the Floure and the Lefe</i>

We'd have to look out for such and treat the bit in italics as a separate titlecasing exercise. We would also have to worry about whether to titlecase titles all in a foreign language.

So, as you'd expect, not as simple as I first thought!

To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/5dc646c1-f218-41f1-9e9e-9e2e80f06f44%40standardebooks.org.

Alex Cabal

unread,
Dec 4, 2023, 11:31:47 AM12/4/23
to standar...@googlegroups.com
OK, looks good, great work David.

We can add this to `se build-title` as the `-t,--titlecase` option. The
option should write out the titlecased title to the heading element and
also the <title> element. The tool should output an error if
`--titlecase` is specified at the same time as the `--stdout` option.

Can you also add these test cases to the unit tests? There is a basic
titlecase test in the `test_simple_cmds.py` file, we should remove that
in favor of the more robust list of tests you wrote. We can still
include it as part of the simple_cmds test.

On 12/3/23 6:29 PM, David at Standard Ebooks wrote:
> I've uploaded to my respository a bunch of likely problem cases drawn
> from the corpus, with the results of running my routine on them (note
> that it now calls the *actual* SE titlecase routine by shelling out to it).
> https://groups.google.com/d/msgid/standardebooks/ef49b03a-b227-4c81-a172-7b3669ad9610%40Spark <https://groups.google.com/d/msgid/standardebooks/ef49b03a-b227-4c81-a172-7b3669ad9610%40Spark?utm_medium=email&utm_source=footer>.

David at Standard Ebooks

unread,
Dec 4, 2023, 7:40:43 PM12/4/23
to standar...@googlegroups.com
Will do, but I'm still a fair while away from being ready to do that.

One question, because it was thrown up by my test cases. Just looking at the plain command-line invocation of se titlecase, if the input is:

"Dr. H. to James Harlowe"   (in one of the letters in Clarissa)

se titlecase changes it to:

"Dr. H. To James Harlowe"  (that is, it capitalises the word 'to')

which is incorrect, yes? Caused by the period after Dr. H.  I think this is just something which would be handled by a lint ignore in practice, not something I need to worry about?

David at Standard Ebooks

unread,
Dec 4, 2023, 7:57:23 PM12/4/23
to standar...@googlegroups.com
Similar sort of issue with:

"Written⁠—in⁠—Red" 

which gets changed by se titlecase to:

"Written⁠—In⁠—Red". (that is, it capitalises the word 'in' presumably because of the dashes)
--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Alex Cabal

unread,
Dec 5, 2023, 11:07:53 AM12/5/23
to standar...@googlegroups.com
We just want the output to be the same as titlecase's regular output. It
can't always do the right thing because it can only understand some
basic rules of thumb, not the actual grammar of the sentence.

On 12/4/23 6:57 PM, David at Standard Ebooks wrote:
> Similar sort of issue with:
>
> "Written⁠—in⁠—Red"
>
> which gets changed by se titlecase to:
>
> "Written⁠—In⁠—Red". (that is, it capitalises the word 'in' presumably
> because of the dashes)
> On 5 Dec 2023 at 11:40 AM +1100, David at Standard Ebooks
> <standar...@thegriggs.org>, wrote:
>> Will do, but I'm still a fair while away from being ready to do that.
>>
>> One question, because it was thrown up by my test cases. Just looking
>> at the plain command-line invocation of se titlecase, if the input is:
>>
>> "Dr. H. to James Harlowe"   (in one of the letters in /Clarissa/)
>>
>> se titlecase changes it to:
>>
>> "Dr. H. To James Harlowe"  (that is, it capitalises the word 'to')
>>
>> which is incorrect, yes? Caused by the period after Dr. H.  I think
>> this is just something which would be handled by a lint ignore in
>> practice, not something I need to worry about?
>> On 5 Dec 2023 at 3:31 AM +1100, Alex Cabal <al...@standardebooks.org>,
>> wrote:
>>
>> OK, looks good, great work David.
>>
>> We can add this to `se build-title` as the `-t,--titlecase`
>> option. The
>> option should write out the titlecased title to the heading
>> element and
>> also the <title> element. The tool should output an error if
>> `--titlecase` is specified at the same time as the `--stdout` option.
>>
>> Can you also add these test cases to the unit tests? There is a basic
>> titlecase test in the `test_simple_cmds.py` file, we should remove
>> that
>> in favor of the more robust list of tests you wrote. We can still
>> include it as part of the simple_cmds test.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Standard Ebooks" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to standardebook...@googlegroups.com
>> <mailto:standardebook...@googlegroups.com>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/standardebooks/ec473b37-a59d-411e-830e-a9160d928dcb%40Spark <https://groups.google.com/d/msgid/standardebooks/ec473b37-a59d-411e-830e-a9160d928dcb%40Spark?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/054d1827-86b6-4772-97aa-579939d2c284%40Spark <https://groups.google.com/d/msgid/standardebooks/054d1827-86b6-4772-97aa-579939d2c284%40Spark?utm_medium=email&utm_source=footer>.

David at Standard Ebooks

unread,
Dec 5, 2023, 5:24:35 PM12/5/23
to standar...@googlegroups.com
Great, thanks. I’m getting close to code I’m happy with in my stand-alone repository, so I’ll clean it up and move on to looking at integrating it into the toolset.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/c37b5ec0-9ecc-4bf3-b91c-be1051cf3181%40standardebooks.org.

David at Standard Ebooks

unread,
Dec 14, 2023, 10:31:50 PM12/14/23
to standar...@googlegroups.com
I'm going to have to ask for some help with this. 

I've introduced my tagged titlecasing code to formatting.py but I've hit a stone wall when it comes to be able to determine how to write the titlecased string back to the heading tags in the source file. 

The generate_title function in formatting.py spends a lot of effort finding titles and subtitles but then strips out all tags, noterefs, etc to supply something which will work in the <title> tag of the file. Whereas I need to have it in its original form. I can figure that out (using .inner_xml) but I can't work out how to get hold of the nodes I need to write the result back to. I tried a couple of things which I've since removed.

Any suggestions appreciated (I admit my knowledge of xpath is very rudimentary).

Here's the branch I'm working on: https://github.com/standardebooks/tools/tree/add_titlecaser_to_build_title .

Erin

unread,
Dec 16, 2023, 4:38:00 AM12/16/23
to Standard Ebooks
As this is an option for build-title, I assume you're only interested in writing to the node/s used to generate the text content of the <title> element. If that's the case, then it seems you could consider the two main cases from generate_title to find those node/s: first, when there is an hgroup without any h# or header ancestor, in which case you want the children of that hgroup; and second, when there is no such hgroup, in which case you want the first h# element. If that would work, then couldn't the xpath from generate_title for those cases be used (perhaps modified slightly)? Or is this something you've already seen a problem with?

David at Standard Ebooks

unread,
Dec 17, 2023, 6:04:25 PM12/17/23
to Standard Ebooks
Thanks, Erin. I've tried that approach but am still struggling with it (as I say, my limited grasp of xpath is part of the problem). But I'm starting to think there's also a conceptual problem.

The more I think about this project, the less I think it should be part of se build-title. The latter command already strips out anything which can't go into a HTML <title> tag, and while it's probably useful to have that string automatically titlecased via a command-line option, I think that's as far as it should go.

Consider a new project where the producer is faced with a PG transcript where there are potentially scores of headings in all caps. As well as the chapter titles and subtitles, these could also include the headings of sub-chapters and other headings which aren't used as the source of the <title> tag. Titlecasing these is currently a tedious manual exercise, one which I typically do in the very early days of production.

What do we think about modifying the se titlecase function in the following ways:
  1. It will still accept a string on the command line and in that case return a titlecased string on the command-line.
  2. Change it so that it will accept a string which incorporates HTML tags and still return a correct result on the command-line
  3. Give it an option to work on a directory of files (obviously, intended to be an SE production directory).
  4. By default when in that mode, it will titlecase the contents of any h2/h3... etc tag it finds (excluding titlepage.xhtml, uncopyright.xhtml etc.)
  5. And/or it titlecases the content of any tag with an attribute of epub-type="title" or "subtitle" (excluding titlepage.xhtml, uncopyright.xhtml etc.)
Comment?
--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Erin

unread,
Dec 17, 2023, 7:06:24 PM12/17/23
to Standard Ebooks
David, not that it’s my comment you need, but that sounds like a great idea to me. When trying out your code to see how we might target the right nodes, I was actually thinking that it would be more useful to titlecase all headings.* But I ended up assuming that you only wanted to titlecase the headings used to generate the <title> text, because otherwise, as you say, this doesn't seem like an entirely appropriate addition to build-title.

(* I thought this would be especially so in the case of single-file poetry collections. The correct <title> text doesn’t even come from one of the headings in these cases (build-title currently gives an incorrect result for them, as far as I’ve tried, which the producer has to fix manually), and build-title with the titlecase option would be almost useless for such files if it’s only titlecasing exactly one poem's heading, whichever poem it is.)

When I was looking at it, it seemed like it might be more straightforward to titlecase all the headings than to titlecase only the headings providing the <title> text, because you’d then just be finding all the h# elements, similarly to process_headings in se_epub_generate_toc.py, and also p elements within hgroups or headers. But unless I’m simply missing how to generalise it, it also looks like this may require considering special cases if it's going to work for books with less common structures. For Boethius, for example, I had to use the xpath expression "//hgroup/p | //h1 | //h2/span | //h3 | //h4/span" to get the right nodes for all the headings, but that expression is clearly not going to be right for the majority of books.

Whether or not that's a misunderstanding on my part, maybe it would be better anyway to identify the headings to titlecase by using the epub:type attribute, as you suggest in your fifth point. Then we could avoid titlecasing Roman numerals, which we don't want to do (se titlecase "VIII" -> Viii), but which would be done if we simply select h# elements regardless of epub:type.

David at Standard Ebooks

unread,
Dec 18, 2023, 12:17:25 AM12/18/23
to Standard Ebooks
Thanks again, Erin, for your thoughts.

Unless someone else disagrees strongly, I think I'll proceed along the lines I suggested with an extension to the capabilities of se titlecase.

Alex Cabal

unread,
Dec 18, 2023, 12:34:48 PM12/18/23
to standar...@googlegroups.com
I'm not sure about this because the way the tools are named, the
`build-*` tools operate directly on book directories, while other tools
generally don't. We'd need an argument to pass a directory, which is
also unusual for the toolset. I think we're trying to shoehorn new
functionality in for the sake of it... if this belongs anywhere it's in
`build-titles`.



On 12/17/23 5:04 PM, David at Standard Ebooks wrote:
> Thanks, Erin. I've tried that approach but am still struggling with it
> (as I say, my limited grasp of xpath is part of the problem). But I'm
> starting to think there's also a conceptual problem.
>
> The more I think about this project, the less I think it should be part
> of se build-title. The latter command already strips out anything which
> can't go into a HTML <title> tag, and while it's probably useful to have
> that string automatically titlecased via a command-line option, I think
> that's as far as it should go.
>
> Consider a new project where the producer is faced with a PG transcript
> where there are potentially scores of headings in all caps. As well as
> the chapter titles and subtitles, these could also include the headings
> of sub-chapters and other headings which aren't used as the source of
> the <title> tag. Titlecasing these is currently a tedious manual
> exercise, one which I typically do in the very early days of production.
>
> What do we think about modifying the* se titlecase* function in the
> following ways:
>
> 1. It will still accept a string on the command line and in that case
> return a titlecased string on the command-line.
> 2. Change it so that it will accept a string which incorporates HTML
> tags and still return a correct result on the command-line
> 3. Give it an option to work on a directory of files (obviously,
> intended to be an SE production directory).
> 4. By default when in that mode, it will titlecase the contents of any
> h2/h3... etc tag it finds (excluding titlepage.xhtml,
> uncopyright.xhtml etc.)
> 5. And/or it titlecases the content of any tag with an attribute of
>> can figure /that/ out (using .inner_xml) but I can't work out how
>> to get hold of the nodes I need to write the result back to. I
>> tried a couple of things which I've since removed.
>>
>> Any suggestions appreciated (I admit my knowledge of xpath is very
>> rudimentary).
>>
>> Here's the branch I'm working on:
>> https://github.com/standardebooks/tools/tree/add_titlecaser_to_build_title <https://github.com/standardebooks/tools/tree/add_titlecaser_to_build_title> .
>>
>> On 6 Dec 2023 at 9:24 AM +1100, David at Standard Ebooks
>> <standar...@thegriggs.org>, wrote:
>>
>> Great, thanks. I’m getting close to code I’m happy with in my
>> stand-alone repository, so I’ll clean it up and move on to
>> looking at integrating it into the toolset.
>> On 6 Dec 2023 at 3:07 AM +1100, Alex Cabal
>> <al...@standardebooks.org>, wrote:
>>
>> We just want the output to be the same as titlecase's
>> regular output. It
>> can't always do the right thing because it can only
>> understand some
>> basic rules of thumb, not the actual grammar of the sentence.
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Standard Ebooks" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to standardebook...@googlegroups.com
>> <mailto:standardebook...@googlegroups.com>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/standardebooks/a0b09c3c-430f-47a5-92ec-fe090473c592n%40googlegroups.com <https://groups.google.com/d/msgid/standardebooks/a0b09c3c-430f-47a5-92ec-fe090473c592n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/67e8ea6d-999e-427e-b1c5-d5c6ecbe3be8%40Spark <https://groups.google.com/d/msgid/standardebooks/67e8ea6d-999e-427e-b1c5-d5c6ecbe3be8%40Spark?utm_medium=email&utm_source=footer>.

David at Standard Ebooks

unread,
Dec 18, 2023, 4:05:28 PM12/18/23
to standar...@googlegroups.com
OK, then it’s going to take someone smarter than me to integrate it into build-title, if indeed it is even useful there, which I’m starting to doubt.

I might just build a casual stand-alone tool along the lines I suggested, which I can use for my own purposes and make available for anyone else who wants it. Then I don’t need to be too fussed about edge cases, etc. 

Cheers, David / Melbourne, Australia 
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/21a5e796-976d-416f-98e9-d0e521a228fc%40standardebooks.org.

David

unread,
Dec 18, 2023, 4:44:27 PM12/18/23
to Standard Ebooks
I know that feeling, David. :/ And a "casual stand-alone tool" was exactly what I had in mind in my original post in this thread. And, in spite of its limitations (taking especially Vince's observations on board), my little script will probably serve me well enough next time the need arises.

David / Fife, UK

On Monday 18 December 2023 at 21:05:28 UTC David at Standard Ebooks wrote:
OK, then it’s going to take someone smarter than me to integrate it into build-title, if indeed it is even useful there, which I’m starting to doubt.

I might just build a casual stand-alone tool along the lines I suggested, which I can use for my own purposes and make available for anyone else who wants it. Then I don’t need to be too fussed about edge cases, etc. 

Cheers, David / Melbourne, Australia 

David at Standard Ebooks

unread,
Dec 19, 2023, 7:07:06 PM12/19/23
to Standard Ebooks
Well such as it is, here's a little helper program which should do the trick:

https://github.com/drgrigg/tagged_titlecaser

It shells out to se titlecase, so the SE tools need to be installed for this to work.

Let me know if you see any problems.

David / Melbourne, Australia 
--

You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Erin

unread,
Dec 20, 2023, 4:52:40 PM12/20/23
to Standard Ebooks
Thanks for this, David. It’s going to be helpful, I’m sure.

A couple of things from trying it out:

1. There appears to be a typo (one extra "<") here: the final `</h<{match.group(1)}>` needs to be `</h{match.group(1)}>`. Otherwise `</h#>` becomes `</h<#>` in some cases, such as in the closing tag of this h3 heading.

2. The reason that that heading is caught by the else clause that you commented "didn't find an epub:type, so just look for h2, h3, etc tags", despite the fact that it has epub:type="title", seems to be that it has an xml:lang tag as well, so it isn't matching exactly to the regex in your variable title_pattern. Maybe it would be worth changing the regex so that it matches such cases too.

3. Spaces are removed from before tags, e.g. `<h2 epub:type="title">The Eve of <abbr>St.</abbr> Agnes</h2>` becomes `<h2 epub:type="title">The Eve of<abbr>St.</abbr> Agnes</h2>`.

4. The part of the string that comes after a tag is ignored, e.g.` <h2 epub:type="title">The Eve Of <abbr>St.</abbr> AGNES</h2>` remains unchanged.

Are you getting these last two results as well?

David at Standard Ebooks

unread,
Dec 20, 2023, 5:06:14 PM12/20/23
to Standard Ebooks
Thanks, Erin, that’s great feedback. I obviously wasn’t looking closely enough at my test cases. I should be able to fix all those quickly, maybe later today (it’s breakfast time here in Melbourne).

David / Melbourne, Australia 

David at Standard Ebooks

unread,
Dec 20, 2023, 8:39:57 PM12/20/23
to Standard Ebooks
Ha! I found the bug. And that's why I should be using xpath and not regex.

Back to the drawing board, I think.

David / Melbourne, Australia 

Erin

unread,
Dec 20, 2023, 9:31:24 PM12/20/23
to Standard Ebooks
No problem, David. I'm glad if it helped. It's good to hear that you found the problem. Not that you likely need it, but let me know if a second pair of eyes on it would help.

"Today" makes sense for me too, being an hour behind Melbourne. :)

David at Standard Ebooks

unread,
Dec 21, 2023, 12:02:14 AM12/21/23
to Standard Ebooks
Erin: 

I believe that I've fixed the problems you identified (and without trying to figure out how to do it with xpath).

David / Melbourne, Australia 

Erin

unread,
Dec 21, 2023, 2:58:12 PM12/21/23
to Standard Ebooks
Thank you, David. Those problems don't arise anymore for the books I was testing the script on. I'll try to test it on some more in the coming days, but since you've no doubt already done that, I imagine it's in great shape.

One thing I was wondering about is the text within <span epub:type="label"> tags that are themselves within h# elements, such as in the examples here in the manual. Do you think it would be worth titlecasing those labels? If the transcription already contained them, they would be easy to fix with a text editor's search and replace function anyway, unlike headings, so it's understandable if not. But maybe it would be consistent with the checking already done in change_case for headings such as "CHAPTER X", where the label (and not the numeral) is titlecased.

David at Standard Ebooks

unread,
Dec 21, 2023, 4:58:33 PM12/21/23
to Standard Ebooks
Yes, that’s a good thought about the labels, shouldn’t be hard for me to add that.

David / Melbourne, Australia 
--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

David at Standard Ebooks

unread,
Dec 21, 2023, 7:59:04 PM12/21/23
to Standard Ebooks
Done!

Plus fixed an embarrassing bug in the way I was compiling and using the regex patterns.

David / Melbourne, Australia 

David at Standard Ebooks

unread,
Dec 22, 2023, 10:57:51 PM12/22/23
to Standard Ebooks
Here's the link to my repository again. Please feel free to use it if you think it would be useful, particularly in the early days of a project when you are cleaning up Gutenberg transcripts.

https://github.com/drgrigg/tagged_titlecaser

David / Melbourne, Australia 
Reply all
Reply to author
Forward
0 new messages